Parameterless Data Compression and Noise Filtering Using Association Rule Mining

Size: px
Start display at page:

Download "Parameterless Data Compression and Noise Filtering Using Association Rule Mining"

Transcription

1 Parameterless Data Compression and Noise Filtering Using Association Rule Mining Yew-Kwong Woon 1, Xiang Li 2, Wee-Keong Ng 1, and Wen-Feng Lu 23 1 Nanyang Technological University, Nanyang Avenue, Singapore , SINGAPORE 2 Singapore Institute of Manufacturing Technology, 71 Nanyang Drive, Singapore , SINGAPORE 3 Singapore-MIT Alliance Abstract. The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. 1 Introduction The information age was hastily ushered in by the birth of the World Wide Web (Web) in All of sudden, an abundance of information, in the form of web pages and digital libraries, was available at the fingertips of anyone who was connected to the Web. Researchers from the Online Computer Library Center found that there were 7 million unique sites in the year 2000 and the Web was predicted to continue its fast expansion [1]. Data mining becomes important because traditional statistical techniques are no longer feasible to handle such immense data. Cluster analysis, or clustering, becomes the data mining technique of choice because of its ability to function with little human supervision. Clustering is the process of grouping a set of physical/abstract objects

2 into classes of similar objects. It has been found to be useful for a wide variety of applications such as web usage mining [2], manufacturing [3], personalization of web pages [4] and digital libraries [5]. Researchers begin to analyze traditional clustering techniques in an attempt to adapt them to current needs. One such technique is the classic k-means algorithm [6]. It is fast but is very sensitive to the parameter k and noise. Recent clustering techniques that attempt to handle noise more effectively include density-based techniques [7], grid-based techniques [8] and resolution-based techniques [9, 10]. However, all of them require the fine-tuning of complex parameters to remove the adverse effects of noise. Empirical studies show that many adjustments need to be made and an optimal solution is not always guaranteed [10]. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. Since most data, such as digital library documents, web logs and manufacturing specifications, have many features or dimensions, this shortcoming is unacceptable. There are also several work on outlier/noise detection but they too require the setting of non-intuitive parameters [11, 12]. In this paper, we present a novel unsupervised method of filtering noise using ideas borrowed from association rule mining (ARM) [13]. We term our technique, FLUID (FiLtering Using Itemset Discovery). FLUID first maps the dataset into a set of items using binning. Next, ARM is applied to it to discover frequent itemsets. As there has been sustained intense interest in ARM since its conception in 1993, ARM algorithms have improved by leaps and bounds. Any ARM algorithm can be used by FLUID and this allows the leveraging of the efficiency of latest ARM methods. After frequent itemsets are found, they are mapped back to become representative points of the original dataset. This capability of FLUID not only eliminates the problematic need for noise removal in existing clustering algorithms but also improves their efficiency and scalability because the size of the dataset is significantly reduced. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. The rest of the paper is organized as follows. The next section reviews related work in the areas of clustering, outlier detection, ARM while Section 3 presents the FLUID algorithm. Experiments are conducted on both real and synthetic datasets to assess the feasibility of FLUID in Section 4. Finally, the paper is concluded in Section 5. 2 Related Work In this section, we review prominent works in the areas of clustering and outlier detection. The problem of ARM and its representative algorithms are discussed as well. 2.1 Clustering and Outlier Detection The k-means algorithm is the pioneering algorithm in clustering [6]. It begins by randomly generating k cluster centers known as centroids. Objects are iteratively

3 assigned to the cluster where the distance between itself and the cluster s centroid is the shortest. It is fast but sensitive to the parameter k and noise. Densitybased methods are more noise-resistant and are based on the notion that dense regions are interesting regions. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is the pioneering density-based technique [7]. It uses two input parameters to define what constitutes the neighborhood of an object and whether its neighborhood is dense enough to be considered. Grid-based techniques can also handle noise. They partition the search space into a number of cells/units and perform clustering on such units. CLIQUE (CLustering In QUEst) considers a unit to be dense if the number of objects in it exceeds a density threshold and uses an apriori-like technique to iteratively derive higherdimensional dense units. CLIQUE requires the user to specify a density threshold and the size of grids. Recently, resolution-based techniques are proposed and applied successfully on noisy datasets. The basic idea is that when viewed at different resolutions, the dataset reveals different clusters and by visualization or change detection of certain statistics, the correct resolution at which noise is minimum can be chosen. WaveCluster is a resolution-based algorithm that uses wavelet transformation to distinguish clusters from noise [9]. Users must first determine the best quantization scheme for the dataset and then decide on the number of times to apply wavelet transform. The TURN* algorithm is another recent resolution-based algorithm [10]. It iteratively scales the data to various resolutions. To determine the ideal resolution, it uses the third differential of the series of cluster feature statistics to detect an abrupt change in the trend. However, it is unclear how certain parameters such as the closeness threshold and the step size of resolution scaling are chosen. Outlier detection is another means of tackling noise. One classic notion is that of DB(Distance-Based)-outliers [11]. An object is considered to be a DB-outlier if a certain fraction f of the dataset lies greater than a distance D from it. A recent enhancement of it involves the use of the concept of k-nearest neighbors [12]; the top n points with the largest D k (distance of the k th nearest neighbor of a point) are treated as outliers. The parameters f, D,k,n must be supplied by the user. In summary, currently, there is no ideal solution to the problem of noise and existing clustering algorithms require much parameter tweaking which becomes difficult for high-dimensional datasets. Even if somehow their parameters can be optimally set for a particular dataset, there is no guarantee that the same settings will work for other datasets. The problem is similar in the area of outlier detection. 2.2 Association Rule Mining Since the concept of ARM is central to FLUID, we formally define ARM and then survey existing ARM algorithms in this section. A formal description of ARM is as follows. Let the universal itemset, I = {a 1,a 2,...,a U } be a set of literals called items. Let D t be a database of transactions, where each transaction T contains a set of items such that T I. A j-itemset is a set of j unique items.

4 For a given itemset X I and a given transaction T, T contains X if and only if X T. Let ψ X be the support count of an itemset X, which is the number of transactions in D t that contain X. Let s be the support threshold and D t be the number of transactions in D t. An itemset X is frequent if ψ X D t s%. An association rule is an implication of the form X = Y, where X I, Y I and X Y =. The association rule X = Y holds in D t with confidence c% if no less than c% of the transactions in D t that contain X also contain Y. The association rule X = Y has support s% in D t if ψ X Y = D t s%. The problem of mining association rules is to discover rules that have confidence and support greater than the thresholds. It consists of two main tasks: the discovery of frequent itemsets and the generation of association rules from frequent itemsets. Researchers usually tackle the first task only because it is more computationally expensive. Hence, current algorithms are designed to efficiently discover frequent itemsets. We will leverage the ability of ARM algorithms to rapidly discover frequent itemsets in FLUID. Introduced in 1994, the Apriori algorithm is the first successful algorithm for mining association rules [13]. Since its introduction, it has popularized ARM. It introduces a method to generate candidate itemsets in a pass using only frequent itemsets from the previous pass. The idea, known as the apriori property, rests on the fact that any subset of a frequent itemset must be frequent as well. The FP-growth (Frequent Pattern-growth) algorithm is a recent ARM algorithm that achieves impressive results by removing the need to generate candidate itemsets which is the main bottleneck in Apriori [14]. It uses a compact tree structure called a Frequent Pattern-tree (FP-tree) to store information about frequent itemsets. This compact structure also removes the need for multiple database scans and it is constructed using only two scans. The items in the transactions are first sorted and then used to construct the FP-tree. Next, FP-growth proceeds to recursively mine FP-trees of decreasing size to generate frequent itemsets. Recently, we presented a novel trie-based data structure known as the Support-Ordered Trie ITemset (SOTrieIT) to store support counts of 1-itemsets and 2-itemsets [15, 16]. The SOTrieIT is designed to be used efficiently by our algorithm, FOLDARM (Fast OnLine Dynamic Association Rule Mining) [16]. In our recent work on ARM, we propose a new algorithm, FOLD-growth (Fast On- Line Dynamic-growth) which is an optimized hybrid version of FOLDARM and FP-growth [17]. FOLD-growth first builds a set of SOTrieITs from the database and use them to prune the database before building FP-trees. FOLD-growth is shown to outperform FP-growth by up to two orders of magnitude. 3 Filtering Using Itemset Discovery (FLUID) 3.1 Algorithm Given a d-dimensional dataset D o consisting of n objects o 1,o 2,...,o n, FLUID discovers a set of representative objects O 1,O 2,...,O m where m n in three main steps:

5 1. Convert dataset D o into a transactional database D t using procedure MapDB 2. Mine D t for frequent itemsets using procedure MineDB 3. Convert the discovered frequent itemsets back to their original object form using procedure MapItemset Procedure MapDB 1 Sort each dimension of D o in ascending order 2 Compute mean µ x and standard deviation σ x of the nearest object distance in each dimension x by checking the left and right neighbors of each object 3 Find range of values r x for each dimension x 4 Compute number of bins β x for each dimension x: β x = r x /((µ x + 3 σ x ) n 5 Map each bin to a unique item a I 6 Convert each object o i in D o into a transaction T i with exactly d items items by binning its feature values, yielding a transactional database D t Procedure MapDB tries to discretize the features of dataset D o in a way that minimizes the number of required bins without losing the pertinent structural information of D o. Every dimension has its own distribution of values and thus, it is necessary to compute the bin sizes of each dimension/feature separately. Discretization is itself a massive area but experiments reveal that MapDB is good enough to remove noise efficiently and effectively. To understand the data distribution in each dimension, the mean and standard deviation of the closest neighbor distance of every object in every dimension are computed. Assuming that all dimensions follow a Normal distribution, an object should have one neighboring object within three standard deviations of the mean nearest neighbor distance. To avoid having too many bins, there is a need to ensure that each bin would contain a certain number of objects (0.5% of dataset size) and this is accomplished in step 4. In the event that the values are spread out too widely, i.e. the standard deviation is much larger than the mean, the number of standard deviations used in step 4 is reduced to 1 instead of 3. Note that if a particular dimension has less than 100 unique values, steps 2-4 would be unnecessary and the number of bins would be the number of unique values. As mentioned in step 6, each object becomes a transaction with exactly d items because each item represents one feature of the object. The transactions do not have duplicated items because every feature has its own unique set of bins. Once D o is mapped into transactions with unique items, it is now in a form that can be mined by any association rule mining algorithm. Procedure MineDB 1 Set support threshold s = 0.1 (10%)

6 2 Set number of required frequent d-itemsets k = n 3 Let δ(a,b) be the distance between 2 j-itemsets A(a 1,...,a j ) and B(b 1,...,b j ): δ(a,b) = j i=1 (a i b i ) 4 A itemset A is a loner itemset if δ(a,z) > 1, Z L Z A 5 Repeat 6 Repeat 7 Use an association rule mining algorithm to discover a set of frequent itemsets L from D t 8 Remove itemsets with less than d items from L 9 Adjust s using a variable step size to bring L closer to k 10 Until L = k or L stabilizes 11 Set k = 1 2 L 12 Set s = Remove loner itemsets from L 14 Until abrupt change in number of loner itemsets MineDB is the most time-consuming and complex step of FLUID. The key idea here is to discover the optimal set of frequent itemsets that represents the important characteristics of the original dataset; we consider important characteristics as dense regions in the original dataset. In this case, the support threshold s is akin to the density threshold used by density-based clustering algorithms and thus, it can be used to remove regions with low density (itemsets with low support counts). The crucial point here is how to automate the finetuning of s. This is done by checking the number of loner itemsets after each iteration (steps 6-14). Loner itemsets represent points with no neighboring points in the discretized d-dimensional feature space. Therefore, an abrupt change in the number of loner itemsets indicates that the current support threshold value has been reduced to a point where dense regions in the original datasets are being divided into too many sparse regions. This point is made more evident in Section 5 where its effect can be visually observed. The number of desired frequent d-itemsets (frequent itemsets with exactly d items), k, is initially set to the size of the original dataset as seen in step 2. The goal is to obtain the finest resolution of the dataset that is attainable after its transformation. The algorithm then proceeds to derive coarser resolutions in an exponential fashion in order to quickly discover a good representation of the original dataset. This is done at step 11 where k is being reduced to half of L. The amount of reduction can certainly be lowered to get more resolutions but this will incur longer processing time and may not be necessary. Experiments have revealed that our choice suffices for a good approximation of the representative points of a dataset. In step 8, notice that itemsets with less than d items are removed. This is because association rule mining discovers frequent itemsets with various sizes but we are only interested in frequent itemsets containing items that represent all the features of the dataset. In step 9, the support threshold s is incremented/decremented by a variable step size. The step size is variable as it must

7 be made smaller in order to zoom in on the best possible s to obtain the required number of frequent d-itemsets, k. In most situations, it is quite unlikely that L can be adjusted to equal k exactly and thus, if L stabilizes or fluctuates between similar values, its closest approximation to k is considered as the best solution as seen in step 10. Procedure MapItemset 1 for each frequent itemset A L do 2 for each item i A do 3 Assign the center of the bin represented by i as its new value 4 end for 5 end for The final step of FLUID is the simplest: it involves mapping the frequent itemsets back to their original object form. The filtered dataset would now contain representative points of the original dataset excluding most of the noise. Note that the filtering is only an approximation but it is sufficient to remove most of the noise in the data and retain pertinent structural characteristics of the data. Subsequent data mining tasks such as clustering can then be used to extract knowledge from the filtered and compressed dataset efficiently with little complications from noise. Note also that the types of clusters discovered depend mainly on the clustering algorithm used and not on FLUID. 3.2 Complexity Analysis The following are the time complexities of the three main steps of FLUID: 1. MapDB: The main processing time is taken by step 1 and hence, its time complexity is O(nlog n). 2. MineDB: As the total number of iterations used by the loops in the procedure is very small, the bulk of the processing time is attributed to the time to perform association rule mining given by T A. 3. MapItemset: The processing time is dependent on the number of resultant representative points L and thus, it has a time complexity of O(n). Hence, the overall time complexity of FLUID is O(nlog n + T A + n). 3.3 Strengths and Weaknesses The main strength of FLUID is its independence on user-supplied parameters. Unlike its predecessors, FLUID does not require any human supervision. Not only it removes noise/outliers, it compresses the dataset into a set of representative points without any loss of pertinent structural information of the original dataset. In addition, it is reasonably scalable with respect to both the size and

8 (a) (b) (c) (d) Fig. 1. Results of executing FLUID on a synthetic dataset. dimensionality of the dataset as it inherits the efficient characteristics of existing association rule mining algorithms. Hence, it is an attractive preprocessing tool for clustering or other data mining tasks. Ironically, its weakness also stems from its use of association rule mining techniques. This is because association rule mining algorithms do not scale as well as resolution-based algorithms in terms of dataset dimensionality. Fortunately, since ARM is still receiving much attention from the research community, it is possible that more efficient ARM algorithms will be available to FLUID. Another weakness is that FLUID spends much redundant processing time in finding and storing frequent itemsets that have less than d items. This problem is inherent in association rule mining because larger frequent itemsets are usually formed from smaller frequent itemsets. Efficiency and scalability can certainly be improved greatly if there is a way to directly discover frequent d-itemsets. 4 Experiments This section evaluates the viability of FLUID by conducting experiments on a Pentium-4 machine with a CPU clock rate of 2 GHz and 1 GB of main memory. We shall use FOLD-growth as our ARM algorithm in our experiments as it is fast, incremental and scalable [17]. All algorithms are implemented in Java.

9 The synthetic dataset (named t7.10k.dat) used here tests the ability of FLUID to discover clusters of various sizes and shapes amidst much noise; it has been used as a benchmarking test for several clustering algorithms [10]. It has been shown that prominent algorithms like k-means [6], DBSCAN [7], CHAMELEON [18] and WaveCluster [9] are unable to properly find the nine visually-obvious clusters and remove noise even with exhaustive parameter adjustments [10]. Only TURN* [10] manages to find the correct clusters but it requires user-supplied parameters as mentioned in Section 2.1. Figure 1(a) shows the dataset with 10,000 points in nine arbitrary-shaped clusters interspersed with random noise. Figure 1 shows the results of running FLUID on the dataset. FLUID stops at the iteration when Figure 1(c) is obtained but we show the rest of the results to illustrate the effect of loner itemsets. It is clear that Figure 1(c) is the optimal result as most of the noise is removed while the nine clusters remain intact. Figure 1(d) loses much of the pertinent information of the dataset. The number of loner itemsets for Figures 1(b), (c) and (d) is 155, 55 and 136 respectively. Figure 1(b) has the most loner itemsets because of the presence of noise in the original dataset. It is the finest representation of the dataset in terms of resolution. There is a sharp drop in the number of loner itemsets in Figure 1(c) followed by a sharp increase in the number of loner itemsets in Figure 1(d). The sharp drop can be explained by the fact that most noise is removed leaving behind objects that are closely grouped together. In contrast, the sharp increase in loner itemsets is caused by too low a support threshold. This means that only very dense regions are captured and this causes the disintegration of the nine clusters as seen in Figure 1(d). Hence, a change in the trend of the number of loner itemsets is indicative that the structural characteristics of the dataset has changed. FLUID took a mere 6 s to compress the dataset into 1,650 representatives points with much of the noise removed. The dataset is reduced by more than 80% without affecting its inherent structure, that is, the shapes of its nine clusters are retained. Therefore, it is proven in this experiment that FLUID can filter away noise even in a noisy dataset with sophisticated clusters without any user parameters and with impressive efficiency. 5 Conclusions Clustering is an important data mining task especially in our information age where raw data is abundant. Several existing clustering methods cannot handle noise effectively because they require the user to set complex parameters properly. We propose FLUID, a noise-filtering and parameterless algorithm based on association rule mining, to overcome the problem of noise as well as to compress the dataset. Experiments on a benchmarking synthetic dataset show the effectiveness of our approach. In our future work, we will improve and provide vigorous proofs of our approach and design a clustering algorithm that can integrate efficiently with FLUID. In addition, the problem of handling high dimensional datasets will be addressed. Finally, more experiments involving larger datasets with more dimensions will be conducted to affirm the practicality of FLUID.

10 References 1. Dean, N., ed.: OCLC Researchers Measure the World Wide Web. Number 248. Online Computer Library Center (OCLC) Newsletter (2000) 2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1 (2000) Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Boston, Massachusetts, United States (2000) Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discovery of aggregate usage profiles for web personalization. In: Proc. Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000) 5. Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based category profiles. In: Proc. 6th European Conf. on Research and Advanced Technology for Digital Libraries, Rome, Italy (2002) MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability. (1967) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon (1996) Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conf., Seattle, WA (1998) Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB Journal 8 (2000) Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, Maebashi City, Japan (2002) Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. 24th Int. Conf. on Very Large Data Bases. (1998) Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on Very Large Databases, Santiago, Chile (1994) Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Atlanta, Georgia (2001) Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining. In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan (2001) Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for association rule mining. In: Technical Report CAIS-TR-02-48, School of Computer Engineering, Nanyang Technological University, Singapore (2002) 18. Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32 (1999) 68 75

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

A New Approach to Determine Eps Parameter of DBSCAN Algorithm International Journal of Intelligent Systems and Applications in Engineering Advanced Technology and Science ISSN:2147-67992147-6799 www.atscience.org/ijisae Original Research Paper A New Approach to Determine

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Clustering Algorithms In Data Mining

Clustering Algorithms In Data Mining 2017 5th International Conference on Computer, Automation and Power Electronics (CAPE 2017) Clustering Algorithms In Data Mining Xiaosong Chen 1, a 1 Deparment of Computer Science, University of Vermont,

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING Neha V. Sonparote, Professor Vijay B. More. Neha V. Sonparote, Dept. of computer Engineering, MET s Institute of Engineering Nashik, Maharashtra,

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Unsupervised learning on Color Images

Unsupervised learning on Color Images Unsupervised learning on Color Images Sindhuja Vakkalagadda 1, Prasanthi Dhavala 2 1 Computer Science and Systems Engineering, Andhra University, AP, India 2 Computer Science and Systems Engineering, Andhra

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels 15 International Workshop on Data Mining with Industrial Applications K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels Madhuri Debnath Department of Computer Science and Engineering

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011, Weighted Association Rule Mining Without Pre-assigned Weights PURNA PRASAD MUTYALA, KUMAR VASANTHA Department of CSE, Avanthi Institute of Engg & Tech, Tamaram, Visakhapatnam, A.P., India. Abstract Association

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets Sheetal K. Labade Computer Engineering Dept., JSCOE, Hadapsar Pune, India Srinivasa Narasimha

More information

GRID BASED CLUSTERING

GRID BASED CLUSTERING Cluster Analysis Grid Based Clustering STING CLIQUE 1 GRID BASED CLUSTERING Uses a grid data structure Quantizes space into a finite number of cells that form a grid structure Several interesting methods

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Efficient Mining of Generalized Negative Association Rules

Efficient Mining of Generalized Negative Association Rules 2010 IEEE International Conference on Granular Computing Efficient Mining of Generalized egative Association Rules Li-Min Tsai, Shu-Jing Lin, and Don-Lin Yang Dept. of Information Engineering and Computer

More information

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES 7.1. Abstract Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

Data Structure for Association Rule Mining: T-Trees and P-Trees

Data Structure for Association Rule Mining: T-Trees and P-Trees IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 6, JUNE 2004 1 Data Structure for Association Rule Mining: T-Trees and P-Trees Frans Coenen, Paul Leng, and Shakil Ahmed Abstract Two new

More information

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Survey: Efficent tree based structure for mining frequent pattern from transactional databases IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 5 (Mar. - Apr. 2013), PP 75-81 Survey: Efficent tree based structure for mining frequent pattern from

More information

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Zengyou He, Xiaofei Xu, Shengchun Deng Department of Computer Science and Engineering, Harbin Institute of Technology,

More information

Heterogeneous Density Based Spatial Clustering of Application with Noise

Heterogeneous Density Based Spatial Clustering of Application with Noise 210 Heterogeneous Density Based Spatial Clustering of Application with Noise J. Hencil Peter and A.Antonysamy, Research Scholar St. Xavier s College, Palayamkottai Tamil Nadu, India Principal St. Xavier

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Clustering: Model, Grid, and Constraintbased Methods Reading: Chapters 10.5, 11.1 Han, Chapter 9.2 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han,

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

CS412 Homework #3 Answer Set

CS412 Homework #3 Answer Set CS41 Homework #3 Answer Set December 1, 006 Q1. (6 points) (1) (3 points) Suppose that a transaction datase DB is partitioned into DB 1,..., DB p. The outline of a distributed algorithm is as follows.

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,

More information

CSCI6405 Project - Association rules mining

CSCI6405 Project - Association rules mining CSCI6405 Project - Association rules mining Xuehai Wang xwang@ca.dalc.ca B00182688 Xiaobo Chen xiaobo@ca.dal.ca B00123238 December 7, 2003 Chen Shen cshen@cs.dal.ca B00188996 Contents 1 Introduction: 2

More information

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1 MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-sheng Chen, Oner Ulvi Celepcikay, Christian Giusti, and Christoph F. Eick Computer Science Department,

More information

Association Rules Mining using BOINC based Enterprise Desktop Grid

Association Rules Mining using BOINC based Enterprise Desktop Grid Association Rules Mining using BOINC based Enterprise Desktop Grid Evgeny Ivashko and Alexander Golovin Institute of Applied Mathematical Research, Karelian Research Centre of Russian Academy of Sciences,

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

THE STUDY OF WEB MINING - A SURVEY

THE STUDY OF WEB MINING - A SURVEY THE STUDY OF WEB MINING - A SURVEY Ashish Gupta, Anil Khandekar Abstract over the year s web mining is the very fast growing research field. Web mining contains two research areas: Data mining and World

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information