Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Size: px
Start display at page:

Download "Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]"

Transcription

1 Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden James Lewis Computer Science and Engineering University of California, San Diego Abstract We discuss the issue of how well K-means scales to large databases. We evaluate the performance of our implementation of a scalable variant of K- means, from Bradley, Fayyad and Reina (1998b), that uses several, fairly complicated, types of compression to t points into a xed size buer, which is then used for the clustering. The running time of the algorithm and the quality of the resulting clustering are measured, for a synthetic dataset and the database from the KDD '98 Cup data mining contest. Lesion studies are performed to see if all types of compression are needed. We nd that a simple special case of the algorithm, where all points are represented with their suf- cient statistics after being used to update the model once, produces almost as good clusterings as the full scalable K-means algorithm in much less time. Setting all the parameters of the scalable algorithm was dicult and we were unable to make it cluster as well as the regular K-means operating on the whole dataset. Introduction Clustering is a method for grouping together similar data items in a database. Similar data items can be seen as being generated from a probability distribution which is an unobserved feature of the data elements. The clustering problem involves determining the parameters of the probability distributions which generated the observable data elements. The K-means algorithm was developed as a solution to the clustering problem. The K-means algorithm is an iterative renement algorithm. It assumes data points are drawn from a xed number (K) of clusters. It attempts to determine the parameters for each cluster by making a hard assignment of each data point to one of the K clusters. The algorithm is an iterative process of assigning cluster membership and reestimating the cluster parameters. The algorithm terminates when the data points no longer change membership due to the reestimated cluster parameters. When data points change cluster membership, it will only be with neighbors of their current cluster. Therefore, only the distance from the data point to the neighboring cluster means need to be checked to determine if the point will change membership. This insight can make the K-means algorithm more ecient than checking the distance from a point to each of the K cluster means. However, in high dimensions, datasets tend to be sparse and the number of neighboring clusters may be large. The time complexity of the algorithm is based on the access time for each data element, again with each element needing to be accessed on each iteration. The K-means algorithm can become inecient for large databases. A modied version of this algorithm which addresses these ineciencies will be investigated here. The scalable K-means algorithm Bradley, Fayyad and Reina (1998b) describe a method of scaling clustering algorithms, in particular K-means, to large datasets. The idea is to use a buer where points from the database are saved in compressed form. First, the means of the clusters are initialized as in the ordinary K-means. Then, all available space in the buer is lled up with points from the database. The current model is updated on the buer contents in the usual way. The buer contents is then compressed in two steps. The rst step, called the primary compression, nds points that are close to the cluster they are currently assigned to. These points are unlikely to end up in a dierent cluster. There are two methods to do this. The rst measures the Mahalanobis distance from the point to the cluster mean it's associated with, and discards the point if it is within a certain radius. For the second method, condence intervals are set up for the cluster means. Then, for each point, a worst case scenario is set up by perturbing the cluster means within the condence intervals. The cluster mean that is associated with the point is moved away from the point, and the cluster means of all other clusters are moved towards the point. If the point is closest to the same cluster mean after the perturbations, it is unlikely to change cluster membership. Points that are unlikely to change cluster membership are removed from the buer, and are placed in one of the discard sets. Each of the main clusters has a discard

2 set, represented by the sucient statistics for all points belonging to that cluster that have been discarded. On the rest of the points in the buer, a secondary K-means clustering is performed, with a larger number of clusters than for the main clustering. The reason for doing this is to try to save buer space by storing some of these clusters instead of the points. In order to replace points in the buer with the secondary cluster, the cluster must fulll a tightness criterion, meaning that its standard deviation in each dimension must be below a certain threshold. Clusters are oined with each other using hierarchical agglomerative clustering, as long as the merged clusters are tight enough. The rest of the points in the buer are called the retained set. The space in the buer that has become available because of the compression is now lled up with new points, and the whole procedure is repeated. The algorithm ends after one scan of the dataset, or if the main cluster means do not change much as more points are added. We implemented the algorithm in C++, without reusing any existing code. The platform for our experiments was a dual Pentium II workstation with 256 MB RAM memory, running Linux. Our program is not multi-threaded, so only one of the processors is directly used for the experiments. The program is compiled with full optimization turned on. Some details of the implementation of the scalable K- means algorithm are not given by Bradley et al (1998b). During the primary compression, a Mahalanobis radius that discards a certain fraction of the newly sampled points belonging to a cluster must be determined. Our implementation computes the distance between each newly sampled point and the cluster it's assigned to, and, for each cluster, sorts the list. Then it's easy to nd a radius so that a certain fraction of points is discarded. Note that this approach may change the time complexity of the whole algorithm. It may be possible to implement this more eciently, at least when the fraction of discarded points is small. For all main and secondary clusters, our implementation stores the sucient statistics (sum of elements, squared sum of elements, number of points) as well as the mean and standard deviation of that cluster. The mean is stored so that the distance between old and new means (the new mean is computed from the sum of the elements) can be computed when doing K-means clustering, and the standard deviation is stored to speed up the primary compression. This makes each cluster occupy about four times as much space as a point. One problem that may occur during secondary compression is that a tight cluster does not t in a buer. Therefore, each time a new tight cluster is found (after oining), the space occupied by that cluster is compared to the space occupied by the points it will replace, and only if it reduces the occupied space in the buer are the points compressed. For our purposes, the sucient statistics are represented by two vectors, Sum and SumSq, and one integer n. The vectors are for storing the sum and the sum of squares of the elements of the points in the cluster, and the integer is used to keep track of the number of points in the cluster. From these statistics, the mean and variance along each dimension can be calculated. Let the sucient statistics of a cluster A be given by (Sum (A) ; SumSq (A) ; n (A) ) If a point x is added to the cluster, the sucient statistics are updated according to Sum (A) SumSq (A) := Sum (A) := SumSq (A) n (A) := n (A) x + x 2 If clusters A and B are merged, the sucient statistics for the resulting cluster C are Sum (C) SumSq (C) = Sum (A) = SumSq (A) n (C) = n (A) + n (B) + Sum (B) + SumSq (B) Synthetic Dataset Synthetic datasets are generated to experiment with the modied K-means algorithms. This allows the cluster means determined by the algorithms to be compared with the known probability distributions of the synthetic data. The data points are drawn from a xed number of Gaussian distributions. Each Gaussian is assigned a random weight which determines the probability of generating a data point from that distribution. The mean and variance for each Gaussian distribution are uniformly sampled, for each dimension, from the intervals [-5,5] and [0.7, 1.5] respectively. To determine the accuracy of the clustering the true cluster means must be compared with the estimated cluster means. The problem of discovering which true cluster mean corresponds to which estimated mean must be solved. As long as the number of clusters K is small, it's possible to use the one of the K! permutations that yields the smallest value. This is the way we do it, and that's the reason why only a few clusters are used in our experiments. Lesion Studies To evaluate the contribution of each data compression method lesion studies have been conducted. The experiments are run using the synthetic datasets with 20 dimensions and data points. Comparisons are made between four variations of the scalable K-means algorithm, the regular K-means algorithm, and a naive scalable K-means algorithm described in the next section. Each method clustered the same synthetic data set and works within a limited buer, except for regular K-means, large enough to hold approximately 10% of

3 the data points. The number of clusters is xed at ve for these lesion studies. The four variations of the scalable K-means algorithm involve adding each of the data compression methods to the previous variation. The rst variation doesn't use any of the data compression techniques proposed in the scalable K-means algorithm. The K-means algorithm would run until convergence on the rst ll of the buer. No points would be added or removed after this initial ll. This variation is similar to clustering on a 10% sampling of the data points. For the next variation, the rst primary compression technique is used. This involves moving to the discard set points within a certain Mahalanobis distance from its associated cluster mean. Next, the second primary compression technique is added. Condence intervals are used to discard data points unlikely to change cluster membership. The fourth variation includes the secondary data compression technique of determining secondary clusters. For secondary compression the number of secondary clusters is xed at 25. All other constants used in these experiments are shown in Table 1. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 25 Fraction of points discarded (p) 10% Table 1: The parameters used for the lesion studies of the scalable K-means algorithm are given above. Naive scalable K-means A special case of the scalable K-means algorithm would be when all points in the buer are discarded each time. The algorithm is: 1. Randomly initialize cluster means. Let each cluster have a discard set in the buer. The discard set keeps track of the sucient statistics for all points from previous iterations. 2. Fill buer with points. 3. Perform K-means iterations on the points and discard sets in the buer, until convergence. For this clustering, the discard sets are treated like regular points placed at the means of the discard sets, but weighted with the number of points in the discard sets. 4. For each cluster, update the sucient statistics of the discard set with the points assigned to the cluster. Remove all points from the buer. 5. If database empty, end. Otherwise, repeat from 2. This algorithm will subsequently in this paper be called naive scalable K-means. Like the scalable algorithm, it performs only one scan over the dataset and works on a xed size buer, but requires much less computation each time the buer is lled, and the whole buer can be lled with new points for each iteration. Results The experiments ran on 100 dierent synthetic datasets. The results are shown in Figure 1 and Figure 2. For each dataset ve dierent clustering models are generated from dierent initial starting positions. The best model is retained for a comparison of the accuracy of the clustering. Some of the synthetic datasets could not be eectively clustered by any of the methods used in the experiments. These datasets distort the quality measurements and are removed before comparing the methods. We believe using a larger number of initial starting positions may reduce the number of datasets which don't converge to a solution. Of the 100 synthetic datasets 24 are not used. To compare the running times, all of the 100 trials are used. Each of the initial starting position are combined to give an accumulative running time for each clustering method. The results of the lesion studies show no increase in accuracy for identifying the clusters for each data compression technique added, but a signicant increase in running time. Except for the rst scalable K-means algorithm, each method used all of the points in the datasets. Therefore, none of the experiments produced a model which converged using less than the full dataset. The cluster quality is about the same for each of the methods which used the whole dataset. The running time is directly related to the complexity of the algorithms. Each additional data compression technique adds to the running time of the algorithm. The regular K-means algorithm converges to a solution faster than and of the scalable K-means algorithms which used data compression. The naive approach runs faster than the other methods while producing the same level of accuracy. It should be noted that the variation using the secondary compression method found no points tight enough to be merged as secondary clusters. The inability of the lesion studies to show any signicant dierences between the quality of the clusterings may have resulted from the use of the synthetic data sets. KDD '98 Dataset To experiment with a real world database, which is suf- ciently large to evaluate the scalability of the proposed algorithm, the dataset from the 1998 KDD (Knowledge and Data Discovery) cup data mining contest is used. This database contains statistical information about people who have made charitable donations in response to direct mailing requests. Clustering can be used to identify groups of donors who may require specialized requests in order to maximize donation prots. The database contains records, each of which has 481 statistical elds. To test the clustering algorithms, we selected 56 features from these elds. The features are represented by vectors of oating point numbers. Numerical data (e.g. donation amounts, income, age) are stored as oating point numbers. Date values (e.g. donation dates, date of birth) are stored as the number of months from a xed min-

4 0.7 Cluster quality 15 Running time Distance between estimated and true means Running time [s] R10 S10 S10 S10 N10 K 0 R10 S10 S10 S10 N10 K Figure 1: The graph shows the sum of the distances between the estimated and true cluster means, for a synthetic dataset of points and ve clusters. The algorithms are random sampling K-means (no compression) (R10), scalable K-means with primary 1 compression only (S10{), with primary 1 and 2 compression (S10-), with all types of compression (S10), the naive scalable K-means (N10) and the regular K-means operating on the whole dataset (K). Error bars show the standard errors. imum date. Binary features, such as belonging to a particular income category, are also stored as oating point numbers. Of the 56 features that are used, 18 are binary. To give equal weight to each feature, each of the elds were normalized to have a mean of zero and variance of one. The records in the database are converted to this format and saved to a separate le which is used to retrieve data points during the clustering process. The purpose of this experiment is to compare the running time and cluster quality of the regular K-means, operating on the whole dataset or on resamples, the scalable K-means algorithm using all types of compression, and the naive scalable K-means where every point is put in the discard set after being used to update the model. The experiment is performed with resamples and buers of 10% and 1% of the size of the whole dataset. The number of clusters is 10. First, the dataset is randomly reordered. Then it is clustered ve times by all algorithms, each time with new randomly chosen initial conditions (but with all algorithms starting from the same initial conditions). The running time is measured. The quality of the clustering is measured by the sum of the squared distances between each point and the cluster mean it is associated with. Of these ve clusterings, the one with the best quality is used. This is done because K-means is known to be sensitive to initial conditions. This whole procedure is repeated 14 times with dierent random orders of the dataset. For the scalable clustering, we found it hard to set Figure 2: The graph shows the average running times for the modied versions of the scalable K-means algorithm. Error bars show the standard errors. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 40 Fraction of points discarded (p) 20% Table 2: The parameters used for the scalable K-means algorithm are given above. the parameters, especially for the secondary clustering. If the number of secondary clusters is low, many points are assigned to each cluster. This means that in order to get any tight clusters at all, the maximum allowed standard deviation must be quite high. If the number of secondary clusters is too high on the other hand, it will take too much time to run the secondary K-means clustering and merge the resulting clusters, with only slightly better clusterings. The parameter values we used are given in Table 2. Figure 3 shows the average quality of the best of the ve clusterings, and Figure 4 shows the average running time of all the clusterings. From Figure 3 it can be seen that the quality of the random sampling K-means operating on a 1% resample is much worse than everything else, and that ordinary K-means produces the best clusterings, which can be expected. Larger buers produce better clusterings. The scalable K-means produce slightly better clusterings than the naive scalable K-means, but the dierence is not signicant. The running time is by far highest for the ordinary K-means, but scalable K-means takes much time as well, compared to the naive scalable and random sampling K- means.

5 4 x 106 Cluster distortion 300 Running time Squared point cluster distance Running time [s] S10 S1 N10 N1 R10 R1 K 0 S10 S1 N10 N1 R10 R1 K Figure 3: The graph shows the sum of the squared distances between each point in the dataset and the cluster mean it's associated with, on the KDD Cup '98 dataset of points and 10 clusters. The algorithms are the scalable K-means (S10 and S1), the "naive" scalable K- means (N10 and N1), random sampling K-means (R10 and R1), and the ordinary K-means working on the whole dataset. names ending with 10 use a buer or resample with size 10% of the whole dataset, names ending with 1 use a 1% buer or resample. Error bars show the standard errors. Complexity To perform one K-means iteration over a set of n entities has time complexity O(n). Here, it is assumed that O(log n) iterations over the set are needed for convergence of the clusters. Table 3 shows the time, space and disk I/O complexity for regular, scalable and naive scalable K-means, where m is the size of the RAM buer. The O(n log m) time complexity of the scalable K-means algorithm is caused by the K-means iterations over the buer, the sorting on Mahalanobis distance in our implementation, and the secondary K-means clustering. Our experiments show that the constant in the time complexity of the scalable K-means is large compared to the constant for regular K-means, and that the gain from using the scalable method comes from reducing the number of disk accesses. This can be seen from the running times on synthetic and real world datasets. For the synthetic datasets, regular K-means runs faster than scalable K-means, but for the real world dataset the scalable K-means is faster. Time Space Disk I/O Regular K-means O(n log n) O(1) O(n log n) Scalable K-means O(n log m) O(m) O(n) Naive K-means O(n log m) O(m) O(n) Table 3: The table shows the time, RAM and disk complexity for three dierent K-means variations. Figure 4: This gure shows the average running time for performing one clustering, for the same runs as the one shown in Figure 3. Error bars show the standard errors. Conclusions We have implemented a scalable K-means algorithm, and performed lesion studies on it. A special, but much simplied, case of the scalable K-means algorithm is when all points are discarded after being used to update the model. This variant has been shown to produce good clusterings of large synthetic and real world datasets, in much less time than the full scalable K- means algorithm. Also, we have shown that the scalable K-means does not produce as good clusterings as the regular K-means operating on the full dataset, even if the parameters are set to give it a running time of about 40% of the full clustering. It may be that we did not nd the right parameters, but we argue that an algorithm which has many parameters without any guidelines of how to set them, is probably not worth the extra eort of implementing it. It is worth to note one extra thing about the scalable K-means algorithm: it is possible to update several models, each starting from dierent initial conditions, sharing the same buer and performing only one scan of the database. We did not use that for our experiments, so it's possible that the running time would have decreased if we did. Also, the scalable algorithm is not restricted to K-means clustering, but can be extended to work with more expensive clustering, such as expectation maximization (EM) (Bradley, Fayyad, & Reina 1998a). Our main conclusion is that a complicated scalable K-means algorithm is only worthwhile if the database is very large or the time it takes to access a record is high, and a high quality clustering is highly desirable.

6 References Bradley, P.; Fayyad, U.; and Reina, C. 1998a. Scaling em (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, Redmond, WA. Bradley, P. S.; Fayyad, U.; and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, 9{15. AAAI Press.

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that: Text Clustering 1 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA GSAT and Local Consistency 3 Kalev Kask and Rina Dechter Department of Information and Computer Science University of California, Irvine, CA 92717-3425 fkkask,dechterg@ics.uci.edu Abstract It has been

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds.

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds. Constrained K-Means Clustering P. S. Bradley K. P. Bennett A. Demiriz Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. Redmond, WA 98052 Renselaer

More information

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, luigig@cs.uwa.edu.au Abstract Clustering is a technique

More information

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases, [12] P. Larson. Linear hashing with partial expansions. In Proceedings of the 6th International Conference on Very Large Databases, pages 224{232, 1980. [13] W. Litwin. Linear hashing: a new tool for le

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

signicantly higher than it would be if items were placed at random into baskets. For example, we

signicantly higher than it would be if items were placed at random into baskets. For example, we 2 Association Rules and Frequent Itemsets The market-basket problem assumes we have some large number of items, e.g., \bread," \milk." Customers ll their market baskets with some subset of the items, and

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

More information

Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes

Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes Carl Pomerance and J. W. Smith CONTENTS 1. Introduction 2. Description of the Method 3. Outline of Experiments 4. Conclusion

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Lecture 13: AVL Trees and Binary Heaps

Lecture 13: AVL Trees and Binary Heaps Data Structures Brett Bernstein Lecture 13: AVL Trees and Binary Heaps Review Exercises 1. ( ) Interview question: Given an array show how to shue it randomly so that any possible reordering is equally

More information

II (Sorting and) Order Statistics

II (Sorting and) Order Statistics II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison

More information

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We

More information

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in A Complete Anytime Algorithm for Number Partitioning Richard E. Korf Computer Science Department University of California, Los Angeles Los Angeles, Ca. 90095 korf@cs.ucla.edu June 27, 1997 Abstract Given

More information

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U Extending the Power and Capacity of Constraint Satisfaction Networks nchuan Zeng and Tony R. Martinez Computer Science Department, Brigham Young University, Provo, Utah 8460 Email: zengx@axon.cs.byu.edu,

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Quality Metrics for Visual Analytics of High-Dimensional Data

Quality Metrics for Visual Analytics of High-Dimensional Data Quality Metrics for Visual Analytics of High-Dimensional Data Daniel A. Keim Data Analysis and Information Visualization Group University of Konstanz, Germany Workshop on Visual Analytics and Information

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

As an additional safeguard on the total buer size required we might further

As an additional safeguard on the total buer size required we might further As an additional safeguard on the total buer size required we might further require that no superblock be larger than some certain size. Variable length superblocks would then require the reintroduction

More information

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao A Population-Based Learning Algorithm Which Learns Both Architectures and Weights of Neural Networks y Yong Liu and Xin Yao Computational Intelligence Group Department of Computer Science University College,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CPSC W2 Midterm #2 Sample Solutions

CPSC W2 Midterm #2 Sample Solutions CPSC 320 2014W2 Midterm #2 Sample Solutions March 13, 2015 1 Canopticon [8 marks] Classify each of the following recurrences (assumed to have base cases of T (1) = T (0) = 1) into one of the three cases

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

GSAT and Local Consistency

GSAT and Local Consistency GSAT and Local Consistency Kalev Kask Computer Science Department University of California at Irvine Irvine, CA 92717 USA Rina Dechter Computer Science Department University of California at Irvine Irvine,

More information

Incremental Discovery of Sequential Patterns. Ke Wang. National University of Singapore. examining only the aected part of the database and

Incremental Discovery of Sequential Patterns. Ke Wang. National University of Singapore. examining only the aected part of the database and Incremental Discovery of Sequential Patterns Ke Wang Jye Tan Department of Information Systems and omputer Science National University of Singapore wangk@iscs.nus.sg, tanjye@iscs.nus.sg bstract In this

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Analytics and Visualization of Big Data

Analytics and Visualization of Big Data 1 Analytics and Visualization of Big Data Fadel M. Megahed Lecture 15: Clustering (Discussion Class) Department of Industrial and Systems Engineering Spring 13 Refresher: Hierarchical Clustering 2 Key

More information

Parallel Galton Watson Process

Parallel Galton Watson Process Parallel Galton Watson Process Olivier Bodini, Camille Coti, and Julien David LIPN, CNRS UMR 7030, Université Paris 3, Sorbonne Paris Cité 99, avenue Jean-Baptiste Clément, F-93430 Villetaneuse, FRANCE

More information

CPSC W2: Quiz 2 Sample Solutions

CPSC W2: Quiz 2 Sample Solutions CPSC 320 2017W2: Quiz 2 Sample Solutions January 27, 2018 1 Wall-E Loman If you nd a path with no obstacles, it probably doesn't lead anywhere. Frank A. Clark We dene TSP just as before: The Traveling

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Hardware Implementation of GA.

Hardware Implementation of GA. Chapter 6 Hardware Implementation of GA Matti Tommiska and Jarkko Vuori Helsinki University of Technology Otakaari 5A, FIN-02150 ESPOO, Finland E-mail: Matti.Tommiska@hut.fi, Jarkko.Vuori@hut.fi Abstract.

More information

Segmentation. Multiple Segments. Lecture Notes Week 6

Segmentation. Multiple Segments. Lecture Notes Week 6 At this point, we have the concept of virtual memory. The CPU emits virtual memory addresses instead of physical memory addresses. The MMU translates between virtual and physical addresses. Don't forget,

More information

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing Solving Traveling Salesman Problem Using Parallel Genetic Algorithm and Simulated Annealing Fan Yang May 18, 2010 Abstract The traveling salesman problem (TSP) is to find a tour of a given number of cities

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract Clustering Sequences with Hidden Markov Models Padhraic Smyth Information and Computer Science University of California, Irvine CA 92697-3425 smyth@ics.uci.edu Abstract This paper discusses a probabilistic

More information

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex THE STATISTICAL SOFTWARE NEWSLETTER 335 Simple Evolutionary Heuristics for Global Optimization Josef Tvrdk and Ivan Krivy University of Ostrava, Brafova 7, 701 03 Ostrava, Czech Republic Phone: +420.69.6160

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms Clustering Distance Measures Hierarchical Clustering k -Means Algorithms 1 The Problem of Clustering Given a set of points, with a notion of distance between points, group the points into some number of

More information

7 Next choice is 4, not 3...

7 Next choice is 4, not 3... Math 0 opics for rst exam Chapter 1: Street Networks Operations research Goal: determine how to carry out a sequence of tasks most eciently Ecient: least cost, least time, least distance,... Example: reading

More information

Randomized Algorithms: Element Distinctness

Randomized Algorithms: Element Distinctness Randomized Algorithms: Element Distinctness CSE21 Winter 2017, Day 24 (B00), Day 16-17 (A00) March 13, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Element Distinctness: WHAT Given list of positive integers

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

Networks for Control. California Institute of Technology. Pasadena, CA Abstract Learning Fuzzy Rule-Based Neural Networks for Control Charles M. Higgins and Rodney M. Goodman Department of Electrical Engineering, 116-81 California Institute of Technology Pasadena, CA 91125 Abstract

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Hash Table and Hashing

Hash Table and Hashing Hash Table and Hashing The tree structures discussed so far assume that we can only work with the input keys by comparing them. No other operation is considered. In practice, it is often true that an input

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an Competitive Neural Trees for Vector Quantization Sven Behnke and Nicolaos B. Karayiannis Department of Mathematics Department of Electrical and Computer Science and Computer Engineering Martin-Luther-University

More information

CPSC 320 Sample Solution, Playing with Graphs!

CPSC 320 Sample Solution, Playing with Graphs! CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Chapter 14 Global Search Algorithms

Chapter 14 Global Search Algorithms Chapter 14 Global Search Algorithms An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Introduction We discuss various search methods that attempts to search throughout the entire feasible set.

More information

The EMCLUS Procedure. The EMCLUS Procedure

The EMCLUS Procedure. The EMCLUS Procedure The EMCLUS Procedure Overview Procedure Syntax PROC EMCLUS Statement VAR Statement INITCLUS Statement Output from PROC EMCLUS EXAMPLES-SECTION Example 1: Syntax for PROC FASTCLUS Example 2: Use of the

More information

Towards the world s fastest k-means algorithm

Towards the world s fastest k-means algorithm Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze

More information

Building a Concept Hierarchy from a Distance Matrix

Building a Concept Hierarchy from a Distance Matrix Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo 1 and Jen-Peng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing An Ecient Algorithm for Mining Association Rules in Large Databases Ashok Savasere Edward Omiecinski Shamkant Navathe College of Computing Georgia Institute of Technology Atlanta, GA 3332 e-mail: fashok,edwardo,shamg@cc.gatech.edu

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping

More information

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Chapter 18 Indexing Structures for Files. Indexes as Access Paths Chapter 18 Indexing Structures for Files Indexes as Access Paths A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file. The index is usually specified

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering K-means and Hierarchical Clustering Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these

More information

residual residual program final result

residual residual program final result C-Mix: Making Easily Maintainable C-Programs run FAST The C-Mix Group, DIKU, University of Copenhagen Abstract C-Mix is a tool based on state-of-the-art technology that solves the dilemma of whether to

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao Department of Computer Science, The University of Hong Kong, Hong Kong. fsdlee,dcheung,aog@cs.hu.h

More information

CPSC W1: Midterm #1 (Group Version) Thu

CPSC W1: Midterm #1 (Group Version) Thu CPSC 320 2016W1: Midterm #1 (Group Version) 2016-10-20 Thu 1 2 1 O'd to a Pair of Runtimes [10 marks] Consider the following pairs of functions (representing algorithm runtimes), expressed in terms of

More information

DS-Means: Distributed Data Stream Clustering

DS-Means: Distributed Data Stream Clustering DS-Means: Distributed Data Stream Clustering Alessio Guerrieri and Alberto Montresor University of Trento, Italy Abstract. This paper proposes DS-means, a novel algorithm for clustering distributed data

More information

CSCI6900 Assignment 3: Clustering on Spark

CSCI6900 Assignment 3: Clustering on Spark DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 3: Clustering on Spark DUE: Friday, Oct 2 by 11:59:59pm Out Friday, September 18, 2015 1 OVERVIEW Clustering is a data mining technique

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information