Parameterless Data Compression and Noise Filtering Using Association Rule Mining

Size: px

Start display at page:

Download "Parameterless Data Compression and Noise Filtering Using Association Rule Mining"

Timothy Collins
5 years ago
Views:

1 Parameterless Data Compression and Noise Filtering Using Association Rule Mining Yew-Kwong Woon 1, Xiang Li 2, Wee-Keong Ng 1, and Wen-Feng Lu 23 1 Nanyang Technological University, Nanyang Avenue, Singapore , SINGAPORE 2 Singapore Institute of Manufacturing Technology, 71 Nanyang Drive, Singapore , SINGAPORE 3 Singapore-MIT Alliance Abstract. The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. 1 Introduction The information age was hastily ushered in by the birth of the World Wide Web (Web) in All of sudden, an abundance of information, in the form of web pages and digital libraries, was available at the fingertips of anyone who was connected to the Web. Researchers from the Online Computer Library Center found that there were 7 million unique sites in the year 2000 and the Web was predicted to continue its fast expansion [1]. Data mining becomes important because traditional statistical techniques are no longer feasible to handle such immense data. Cluster analysis, or clustering, becomes the data mining technique of choice because of its ability to function with little human supervision. Clustering is the process of grouping a set of physical/abstract objects

2 into classes of similar objects. It has been found to be useful for a wide variety of applications such as web usage mining [2], manufacturing [3], personalization of web pages [4] and digital libraries [5]. Researchers begin to analyze traditional clustering techniques in an attempt to adapt them to current needs. One such technique is the classic k-means algorithm [6]. It is fast but is very sensitive to the parameter k and noise. Recent clustering techniques that attempt to handle noise more effectively include density-based techniques [7], grid-based techniques [8] and resolution-based techniques [9, 10]. However, all of them require the fine-tuning of complex parameters to remove the adverse effects of noise. Empirical studies show that many adjustments need to be made and an optimal solution is not always guaranteed [10]. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. Since most data, such as digital library documents, web logs and manufacturing specifications, have many features or dimensions, this shortcoming is unacceptable. There are also several work on outlier/noise detection but they too require the setting of non-intuitive parameters [11, 12]. In this paper, we present a novel unsupervised method of filtering noise using ideas borrowed from association rule mining (ARM) [13]. We term our technique, FLUID (FiLtering Using Itemset Discovery). FLUID first maps the dataset into a set of items using binning. Next, ARM is applied to it to discover frequent itemsets. As there has been sustained intense interest in ARM since its conception in 1993, ARM algorithms have improved by leaps and bounds. Any ARM algorithm can be used by FLUID and this allows the leveraging of the efficiency of latest ARM methods. After frequent itemsets are found, they are mapped back to become representative points of the original dataset. This capability of FLUID not only eliminates the problematic need for noise removal in existing clustering algorithms but also improves their efficiency and scalability because the size of the dataset is significantly reduced. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. The rest of the paper is organized as follows. The next section reviews related work in the areas of clustering, outlier detection, ARM while Section 3 presents the FLUID algorithm. Experiments are conducted on both real and synthetic datasets to assess the feasibility of FLUID in Section 4. Finally, the paper is concluded in Section 5. 2 Related Work In this section, we review prominent works in the areas of clustering and outlier detection. The problem of ARM and its representative algorithms are discussed as well. 2.1 Clustering and Outlier Detection The k-means algorithm is the pioneering algorithm in clustering [6]. It begins by randomly generating k cluster centers known as centroids. Objects are iteratively

3 assigned to the cluster where the distance between itself and the cluster s centroid is the shortest. It is fast but sensitive to the parameter k and noise. Densitybased methods are more noise-resistant and are based on the notion that dense regions are interesting regions. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is the pioneering density-based technique [7]. It uses two input parameters to define what constitutes the neighborhood of an object and whether its neighborhood is dense enough to be considered. Grid-based techniques can also handle noise. They partition the search space into a number of cells/units and perform clustering on such units. CLIQUE (CLustering In QUEst) considers a unit to be dense if the number of objects in it exceeds a density threshold and uses an apriori-like technique to iteratively derive higherdimensional dense units. CLIQUE requires the user to specify a density threshold and the size of grids. Recently, resolution-based techniques are proposed and applied successfully on noisy datasets. The basic idea is that when viewed at different resolutions, the dataset reveals different clusters and by visualization or change detection of certain statistics, the correct resolution at which noise is minimum can be chosen. WaveCluster is a resolution-based algorithm that uses wavelet transformation to distinguish clusters from noise [9]. Users must first determine the best quantization scheme for the dataset and then decide on the number of times to apply wavelet transform. The TURN* algorithm is another recent resolution-based algorithm [10]. It iteratively scales the data to various resolutions. To determine the ideal resolution, it uses the third differential of the series of cluster feature statistics to detect an abrupt change in the trend. However, it is unclear how certain parameters such as the closeness threshold and the step size of resolution scaling are chosen. Outlier detection is another means of tackling noise. One classic notion is that of DB(Distance-Based)-outliers [11]. An object is considered to be a DB-outlier if a certain fraction f of the dataset lies greater than a distance D from it. A recent enhancement of it involves the use of the concept of k-nearest neighbors [12]; the top n points with the largest D k (distance of the k th nearest neighbor of a point) are treated as outliers. The parameters f, D,k,n must be supplied by the user. In summary, currently, there is no ideal solution to the problem of noise and existing clustering algorithms require much parameter tweaking which becomes difficult for high-dimensional datasets. Even if somehow their parameters can be optimally set for a particular dataset, there is no guarantee that the same settings will work for other datasets. The problem is similar in the area of outlier detection. 2.2 Association Rule Mining Since the concept of ARM is central to FLUID, we formally define ARM and then survey existing ARM algorithms in this section. A formal description of ARM is as follows. Let the universal itemset, I = {a 1,a 2,...,a U } be a set of literals called items. Let D t be a database of transactions, where each transaction T contains a set of items such that T I. A j-itemset is a set of j unique items.

4 For a given itemset X I and a given transaction T, T contains X if and only if X T. Let ψ X be the support count of an itemset X, which is the number of transactions in D t that contain X. Let s be the support threshold and D t be the number of transactions in D t. An itemset X is frequent if ψ X D t s%. An association rule is an implication of the form X = Y, where X I, Y I and X Y =. The association rule X = Y holds in D t with confidence c% if no less than c% of the transactions in D t that contain X also contain Y. The association rule X = Y has support s% in D t if ψ X Y = D t s%. The problem of mining association rules is to discover rules that have confidence and support greater than the thresholds. It consists of two main tasks: the discovery of frequent itemsets and the generation of association rules from frequent itemsets. Researchers usually tackle the first task only because it is more computationally expensive. Hence, current algorithms are designed to efficiently discover frequent itemsets. We will leverage the ability of ARM algorithms to rapidly discover frequent itemsets in FLUID. Introduced in 1994, the Apriori algorithm is the first successful algorithm for mining association rules [13]. Since its introduction, it has popularized ARM. It introduces a method to generate candidate itemsets in a pass using only frequent itemsets from the previous pass. The idea, known as the apriori property, rests on the fact that any subset of a frequent itemset must be frequent as well. The FP-growth (Frequent Pattern-growth) algorithm is a recent ARM algorithm that achieves impressive results by removing the need to generate candidate itemsets which is the main bottleneck in Apriori [14]. It uses a compact tree structure called a Frequent Pattern-tree (FP-tree) to store information about frequent itemsets. This compact structure also removes the need for multiple database scans and it is constructed using only two scans. The items in the transactions are first sorted and then used to construct the FP-tree. Next, FP-growth proceeds to recursively mine FP-trees of decreasing size to generate frequent itemsets. Recently, we presented a novel trie-based data structure known as the Support-Ordered Trie ITemset (SOTrieIT) to store support counts of 1-itemsets and 2-itemsets [15, 16]. The SOTrieIT is designed to be used efficiently by our algorithm, FOLDARM (Fast OnLine Dynamic Association Rule Mining) [16]. In our recent work on ARM, we propose a new algorithm, FOLD-growth (Fast On- Line Dynamic-growth) which is an optimized hybrid version of FOLDARM and FP-growth [17]. FOLD-growth first builds a set of SOTrieITs from the database and use them to prune the database before building FP-trees. FOLD-growth is shown to outperform FP-growth by up to two orders of magnitude. 3 Filtering Using Itemset Discovery (FLUID) 3.1 Algorithm Given a d-dimensional dataset D o consisting of n objects o 1,o 2,...,o n, FLUID discovers a set of representative objects O 1,O 2,...,O m where m n in three main steps:

5 1. Convert dataset D o into a transactional database D t using procedure MapDB 2. Mine D t for frequent itemsets using procedure MineDB 3. Convert the discovered frequent itemsets back to their original object form using procedure MapItemset Procedure MapDB 1 Sort each dimension of D o in ascending order 2 Compute mean µ x and standard deviation σ x of the nearest object distance in each dimension x by checking the left and right neighbors of each object 3 Find range of values r x for each dimension x 4 Compute number of bins β x for each dimension x: β x = r x /((µ x + 3 σ x ) n 5 Map each bin to a unique item a I 6 Convert each object o i in D o into a transaction T i with exactly d items items by binning its feature values, yielding a transactional database D t Procedure MapDB tries to discretize the features of dataset D o in a way that minimizes the number of required bins without losing the pertinent structural information of D o. Every dimension has its own distribution of values and thus, it is necessary to compute the bin sizes of each dimension/feature separately. Discretization is itself a massive area but experiments reveal that MapDB is good enough to remove noise efficiently and effectively. To understand the data distribution in each dimension, the mean and standard deviation of the closest neighbor distance of every object in every dimension are computed. Assuming that all dimensions follow a Normal distribution, an object should have one neighboring object within three standard deviations of the mean nearest neighbor distance. To avoid having too many bins, there is a need to ensure that each bin would contain a certain number of objects (0.5% of dataset size) and this is accomplished in step 4. In the event that the values are spread out too widely, i.e. the standard deviation is much larger than the mean, the number of standard deviations used in step 4 is reduced to 1 instead of 3. Note that if a particular dimension has less than 100 unique values, steps 2-4 would be unnecessary and the number of bins would be the number of unique values. As mentioned in step 6, each object becomes a transaction with exactly d items because each item represents one feature of the object. The transactions do not have duplicated items because every feature has its own unique set of bins. Once D o is mapped into transactions with unique items, it is now in a form that can be mined by any association rule mining algorithm. Procedure MineDB 1 Set support threshold s = 0.1 (10%)

6 2 Set number of required frequent d-itemsets k = n 3 Let δ(a,b) be the distance between 2 j-itemsets A(a 1,...,a j ) and B(b 1,...,b j ): δ(a,b) = j i=1 (a i b i ) 4 A itemset A is a loner itemset if δ(a,z) > 1, Z L Z A 5 Repeat 6 Repeat 7 Use an association rule mining algorithm to discover a set of frequent itemsets L from D t 8 Remove itemsets with less than d items from L 9 Adjust s using a variable step size to bring L closer to k 10 Until L = k or L stabilizes 11 Set k = 1 2 L 12 Set s = Remove loner itemsets from L 14 Until abrupt change in number of loner itemsets MineDB is the most time-consuming and complex step of FLUID. The key idea here is to discover the optimal set of frequent itemsets that represents the important characteristics of the original dataset; we consider important characteristics as dense regions in the original dataset. In this case, the support threshold s is akin to the density threshold used by density-based clustering algorithms and thus, it can be used to remove regions with low density (itemsets with low support counts). The crucial point here is how to automate the finetuning of s. This is done by checking the number of loner itemsets after each iteration (steps 6-14). Loner itemsets represent points with no neighboring points in the discretized d-dimensional feature space. Therefore, an abrupt change in the number of loner itemsets indicates that the current support threshold value has been reduced to a point where dense regions in the original datasets are being divided into too many sparse regions. This point is made more evident in Section 5 where its effect can be visually observed. The number of desired frequent d-itemsets (frequent itemsets with exactly d items), k, is initially set to the size of the original dataset as seen in step 2. The goal is to obtain the finest resolution of the dataset that is attainable after its transformation. The algorithm then proceeds to derive coarser resolutions in an exponential fashion in order to quickly discover a good representation of the original dataset. This is done at step 11 where k is being reduced to half of L. The amount of reduction can certainly be lowered to get more resolutions but this will incur longer processing time and may not be necessary. Experiments have revealed that our choice suffices for a good approximation of the representative points of a dataset. In step 8, notice that itemsets with less than d items are removed. This is because association rule mining discovers frequent itemsets with various sizes but we are only interested in frequent itemsets containing items that represent all the features of the dataset. In step 9, the support threshold s is incremented/decremented by a variable step size. The step size is variable as it must

7 be made smaller in order to zoom in on the best possible s to obtain the required number of frequent d-itemsets, k. In most situations, it is quite unlikely that L can be adjusted to equal k exactly and thus, if L stabilizes or fluctuates between similar values, its closest approximation to k is considered as the best solution as seen in step 10. Procedure MapItemset 1 for each frequent itemset A L do 2 for each item i A do 3 Assign the center of the bin represented by i as its new value 4 end for 5 end for The final step of FLUID is the simplest: it involves mapping the frequent itemsets back to their original object form. The filtered dataset would now contain representative points of the original dataset excluding most of the noise. Note that the filtering is only an approximation but it is sufficient to remove most of the noise in the data and retain pertinent structural characteristics of the data. Subsequent data mining tasks such as clustering can then be used to extract knowledge from the filtered and compressed dataset efficiently with little complications from noise. Note also that the types of clusters discovered depend mainly on the clustering algorithm used and not on FLUID. 3.2 Complexity Analysis The following are the time complexities of the three main steps of FLUID: 1. MapDB: The main processing time is taken by step 1 and hence, its time complexity is O(nlog n). 2. MineDB: As the total number of iterations used by the loops in the procedure is very small, the bulk of the processing time is attributed to the time to perform association rule mining given by T A. 3. MapItemset: The processing time is dependent on the number of resultant representative points L and thus, it has a time complexity of O(n). Hence, the overall time complexity of FLUID is O(nlog n + T A + n). 3.3 Strengths and Weaknesses The main strength of FLUID is its independence on user-supplied parameters. Unlike its predecessors, FLUID does not require any human supervision. Not only it removes noise/outliers, it compresses the dataset into a set of representative points without any loss of pertinent structural information of the original dataset. In addition, it is reasonably scalable with respect to both the size and

8 (a) (b) (c) (d) Fig. 1. Results of executing FLUID on a synthetic dataset. dimensionality of the dataset as it inherits the efficient characteristics of existing association rule mining algorithms. Hence, it is an attractive preprocessing tool for clustering or other data mining tasks. Ironically, its weakness also stems from its use of association rule mining techniques. This is because association rule mining algorithms do not scale as well as resolution-based algorithms in terms of dataset dimensionality. Fortunately, since ARM is still receiving much attention from the research community, it is possible that more efficient ARM algorithms will be available to FLUID. Another weakness is that FLUID spends much redundant processing time in finding and storing frequent itemsets that have less than d items. This problem is inherent in association rule mining because larger frequent itemsets are usually formed from smaller frequent itemsets. Efficiency and scalability can certainly be improved greatly if there is a way to directly discover frequent d-itemsets. 4 Experiments This section evaluates the viability of FLUID by conducting experiments on a Pentium-4 machine with a CPU clock rate of 2 GHz and 1 GB of main memory. We shall use FOLD-growth as our ARM algorithm in our experiments as it is fast, incremental and scalable [17]. All algorithms are implemented in Java.

9 The synthetic dataset (named t7.10k.dat) used here tests the ability of FLUID to discover clusters of various sizes and shapes amidst much noise; it has been used as a benchmarking test for several clustering algorithms [10]. It has been shown that prominent algorithms like k-means [6], DBSCAN [7], CHAMELEON [18] and WaveCluster [9] are unable to properly find the nine visually-obvious clusters and remove noise even with exhaustive parameter adjustments [10]. Only TURN* [10] manages to find the correct clusters but it requires user-supplied parameters as mentioned in Section 2.1. Figure 1(a) shows the dataset with 10,000 points in nine arbitrary-shaped clusters interspersed with random noise. Figure 1 shows the results of running FLUID on the dataset. FLUID stops at the iteration when Figure 1(c) is obtained but we show the rest of the results to illustrate the effect of loner itemsets. It is clear that Figure 1(c) is the optimal result as most of the noise is removed while the nine clusters remain intact. Figure 1(d) loses much of the pertinent information of the dataset. The number of loner itemsets for Figures 1(b), (c) and (d) is 155, 55 and 136 respectively. Figure 1(b) has the most loner itemsets because of the presence of noise in the original dataset. It is the finest representation of the dataset in terms of resolution. There is a sharp drop in the number of loner itemsets in Figure 1(c) followed by a sharp increase in the number of loner itemsets in Figure 1(d). The sharp drop can be explained by the fact that most noise is removed leaving behind objects that are closely grouped together. In contrast, the sharp increase in loner itemsets is caused by too low a support threshold. This means that only very dense regions are captured and this causes the disintegration of the nine clusters as seen in Figure 1(d). Hence, a change in the trend of the number of loner itemsets is indicative that the structural characteristics of the dataset has changed. FLUID took a mere 6 s to compress the dataset into 1,650 representatives points with much of the noise removed. The dataset is reduced by more than 80% without affecting its inherent structure, that is, the shapes of its nine clusters are retained. Therefore, it is proven in this experiment that FLUID can filter away noise even in a noisy dataset with sophisticated clusters without any user parameters and with impressive efficiency. 5 Conclusions Clustering is an important data mining task especially in our information age where raw data is abundant. Several existing clustering methods cannot handle noise effectively because they require the user to set complex parameters properly. We propose FLUID, a noise-filtering and parameterless algorithm based on association rule mining, to overcome the problem of noise as well as to compress the dataset. Experiments on a benchmarking synthetic dataset show the effectiveness of our approach. In our future work, we will improve and provide vigorous proofs of our approach and design a clustering algorithm that can integrate efficiently with FLUID. In addition, the problem of handling high dimensional datasets will be addressed. Finally, more experiments involving larger datasets with more dimensions will be conducted to affirm the practicality of FLUID.

10 References 1. Dean, N., ed.: OCLC Researchers Measure the World Wide Web. Number 248. Online Computer Library Center (OCLC) Newsletter (2000) 2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1 (2000) Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Boston, Massachusetts, United States (2000) Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discovery of aggregate usage profiles for web personalization. In: Proc. Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000) 5. Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based category profiles. In: Proc. 6th European Conf. on Research and Advanced Technology for Digital Libraries, Rome, Italy (2002) MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability. (1967) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon (1996) Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conf., Seattle, WA (1998) Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB Journal 8 (2000) Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, Maebashi City, Japan (2002) Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. 24th Int. Conf. on Very Large Data Bases. (1998) Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on Very Large Databases, Santiago, Chile (1994) Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Atlanta, Georgia (2001) Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining. In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan (2001) Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for association rule mining. In: Technical Report CAIS-TR-02-48, School of Computer Engineering, Nanyang Technological University, Singapore (2002) 18. Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32 (1999) 68 75

COMP 465: Data Mining Still More on Clustering

3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following