Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Similar documents
WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

On Frequent Itemset Mining With Closure

ETP-Mine: An Efficient Method for Mining Transitional Patterns

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Mining Quantitative Association Rules on Overlapped Intervals

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

COSI-tree: Identification of Cosine Interesting Patterns Based on FP-tree

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Memory issues in frequent itemset mining

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Monotone Constraints in Frequent Tree Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Improved Apriori Algorithm for Association Rules

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Chapter 7: Frequent Itemsets and Association Rules

Mining of Web Server Logs using Extended Apriori Algorithm

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Module: CLUTO Toolkit. Draft: 10/21/2010

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

COFI Approach for Mining Frequent Itemsets Revisited

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Privacy leakage in multi-relational databases: a semi-supervised learning perspective

Mining Temporal Association Rules in Network Traffic Data

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

ASSOCIATION rules mining is a very popular data mining

Performance Based Study of Association Rule Algorithms On Voter DB

Parallel Popular Crime Pattern Mining in Multidimensional Databases

Association Pattern Mining. Lijun Zhang

Role of Association Rule Mining in DNA Microarray Data - A Research

On Mining Max Frequent Generalized Itemsets

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Association Rule Mining. Introduction 46. Study core 46

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

DATA MINING II - 1DL460

Mining Spatial Gene Expression Data Using Association Rules

Data mining, 4 cu Lecture 6:

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Item Set Extraction of Mining Association Rule

I. INTRODUCTION. Keywords : Spatial Data Mining, Association Mining, FP-Growth Algorithm, Frequent Data Sets

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Interestingness Measurements

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

Maintenance of the Prelarge Trees for Record Deletion

CS570 Introduction to Data Mining

Improved Frequent Pattern Mining Algorithm with Indexing

CSCI6405 Project - Association rules mining

A Hierarchical Document Clustering Approach with Frequent Itemsets

A mining method for tracking changes in temporal association rules from an encoded database

AN ONTOLOGY-BASED FRAMEWORK FOR MINING PATTERNS IN THE PRESENCE OF BACKGROUND KNOWLEDGE

Fuzzy Cognitive Maps application for Webmining

Mining Time-Profiled Associations: A Preliminary Study Report. Technical Report

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

CS570: Introduction to Data Mining

Mining Frequent Patterns Based on Data Characteristics

Data Structure for Association Rule Mining: T-Trees and P-Trees

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining

Mining High Average-Utility Itemsets

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

FP-Growth algorithm in Data Compression frequent patterns

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Pattern Lattice Traversal by Selective Jumps

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Distance-based Methods: Drawbacks

Lecture 2 Wednesday, August 22, 2007

An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques

Appropriate Item Partition for Improving the Mining Performance

Implementation of object oriented approach to Index Support for Item Set Mining (IMine)

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

An Efficient Algorithm for finding high utility itemsets from online sell

A New Approach to Discover Periodic Frequent Patterns

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Review of Algorithm for Mining Frequent Patterns from Uncertain Data

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Mining Negative Rules using GRD

Temporal Weighted Association Rule Mining for Classification

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Advance Association Analysis

SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint Λ

Chapter 7: Frequent Itemsets and Association Rules

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

Mining Significant Graph Patterns by Leap Search

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

Reduce convention for Large Data Base Using Mathematical Progression

TOP-K φ Correlation Computation

Associating Terms with Text Categories

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

Comparative Study of Subspace Clustering Algorithms

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Processing Load Prediction for Parallel FP-growth

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

Mining frequent item sets without candidate generation using FP-Trees

Transcription:

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu MSIS Department, Rutgers University, USA hui@rbs.rutgers.edu 3 Dept. of Computer Science, South Texas College, USA sysung@southtexascollege.edu Abstract. Hyperclique patterns are groups of objects which are strongly related to each other. Indeed, the objects in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by uncentered Pearson s correlation coefficient. Recent literature has provided the approach to discovering hyperclique patterns over data sets with binary attributes. In this paper, we introduce algorithms for mining maximal hyperclique patterns in large data sets containing quantitative attributes. An intuitive and simple solution is to partition quantitative attributes into binary attributes. However, there is potential information loss due to partitioning. Instead, our approach is based on a normalization scheme and can directly work on quantitative attributes. In addition, we adopt the algorithm structures of three popular association pattern mining algorithms and add a critical clique pruning technique. Finally, we compare the performance of these algorithms for finding quantitative maximal hyperclique patterns using some real-world data sets. Introduction A hyperclique pattern [, 5] is a new type of association pattern that contains items which are highly affiliated with each other. More specifically, the presence of an item in one transaction strongly implies the presence of every other item that belongs to the same hyperclique pattern. Conceptually, the problem of mining hyperclique pattern in transaction data sets can be viewed as finding approximately all-one sub-matrix in a 0- matrix where each column may correspond to an item and each row may correspond to a transaction. For the rest of this paper, we refer to this problem as the binary hyperclique mining problem. However, in many business and scientific domains, there are data sets which contain quantitative attributes (e.g. income, gene expression level). How to define and efficiently identify hyperclique patterns in data sets with quantitative attributes remains a big challenge in the literature. To this end, the focus of this paper is to address the quantitative hyperclique pattern mining problem. To the best of our knowledge, there is no previous work on developing algorithms for finding quantitative maximal hyperclique patterns. Our approach for mining quantitative hyperclique patterns is built on top of the normalization scheme [9]. A side effect of the normalization scheme is that there is no support

pruning for single items. To meet with this computational challenge, we design a clique pruning method to dramatically remove a large number of items which are weakly related to each other, and thus effectively improving the overall computational performance for finding quantitative hyperclique patterns. We adopt structures of three popular association pattern mining algorithms including FPtree [6], diffeclat [], and Mafia [3] as the bases of our algorithms. The purpose of these algorithms is to find quantitative maximal hyperclique pattern, which is a more compact representation of quantitative hyperclique patterns and is desirable for many applications, such as pattern preserving clustering [0]. A hyperclique pattern is a maximal hyperclique pattern if no superset of this pattern is a hyperclique pattern. Finally, we briefly introduce the results of using our approach on some real-world data sets. Normalization and Quantitative Hyperclique Patterns Normalization. In this paper, we adopt the normalization method proposed in [9]. For a vector x =< x,x,...,x n >, our normalization will turn the vector as x =< x,x,...,x 3 > =< x x, x xn x,..., x >, where x = n k= x k. After data normalization, we define the support of every individual item i, σ L (i) = x + x +... + x n = and the support of an itemset X is defined as σ min, L (X) = i T (min{t(i,j) j I}), where T(i, j) means the normalized value of item j in the transaction i. One advantage of this normalization is that the resulting support is a number between 0 and. Such normalization is natural in many domains, e.g., text documents. However, a side-effect of this is that individual items can no longer be pruned using a support threshold since all single items have a support of. Quantitative Hyperclique Patterns. A traditional binary hyperclique pattern [] is a frequent itemset with the additional constraint that every item in the itemset implies the presence of the remaining items with a minimum level of confidence known as the h-confidence. Specifically, we have the following: Definition. A set of attributes, X, forms a hyperclique pattern with a particular level of h-confidence, where h-confidence is defined as hconf(x) = min{conf({i} {X {i}})} = σ(x)/max{σ(i)} () i X i X Where σ is the standard support function []. H-confidence, just like standard support, is in the interval [0,] and it has the anti-monotone property; that is, the h-confidence of an itemset is greater than or equal to that of its any superset. Also, hyperclique patterns have the high affinity property, i.e., items in a pattern with a high h-confidence are guaranteed to have a high pairwise similarity as measured by the cosine metric. Additionally, there is an important relationship between h-confidence of binary hyperclique patterns and the support function σ min, L (X). In particular, since σ min, L (X) is equivalent to standard support for binary data, we can substitute σ min, L (X) for the standard support function σ(x) in Equation. It is then interesting to note

that if we normalize all attributes to have an L norm of, i.e., σ min, L (i) = for all items i, then, by Equation, hconf(x) = σ min, L (X), since the normalization sets the support of all the item to, we get max i X {σ(i)} = support(i) =. In a nutshell, finding continuous hyperclique patterns first proceeds by normalizing the attributes to have an L norm of. Then, for each row, we take the minimum of the specified attributes. Finally, we square each of these values and add them up. The resulting value is the h-confidence and is a lower bound on the pairwise cosine similarity. 3 Algorithm Descriptions Here, we present the algorithms for mining quantitative maximal hyperclique patterns. Our algorithms are built on top of three state-of-the-art association pattern mining algorithms including FPTree [6], diffeclat [], and Mafia [3]. Clique Pruning. We design a clique pruning method for eliminating weakly related single items. Specifically, we first compute h-confidence of all item pairs on the normalized data. For each item, we then identify the maximum h-confidence value among all pairs including this item. Finally, for a user-specified threshold, we prune all items whose maximum h-confidence is less than this threshold. Algorithm based on FP-Tree: FP-Tree [6] is a compact tree structure which allows to identify frequent patterns without generating the candidate patterns. Here, we adopt the FP-tree algorithm for finding quantitative maximal hyperclique patterns. First, we store float values instead of integer values, since the support of the normalized data are continuous. Second, the support values should be squared before added to the FP-Tree since they have an L Norm. Finally, we need to split squared transactions and make the support of preceding item not less than the successor item, before adding them into the FP-Tree. Algorithm based on MAFIA. MAFIA [3] is a depth-first searching algorithm for mining maximal frequent patterns. For the data set with continuous attributes, we change the algorithm to store not only the tidset, but also the support (normalized data) for each transaction. For this purpose, the algorithm needs a float vector to store the support information. Each element in the vector presents the support for each transaction in order. Algorithm based on DiffEclat. DiffEclat uses a vertical data representation, called diffset, for efficiently mining maximal frequent patterns[]. The diffset only store the different set of transaction ID between the pattern and its parant pattern. The key modification that we made is to store both transaction IDs and the support information. However, for diffset, we store the support different between a pattern and its parent pattern instead of the support itself. 4 Experimental Evaluation Experimental Setup Our experiments were performed on two real-life gene expression data sets, Colon Cancer and NCI [, 7]. Table shows some characteristics of these gene expression data sets.

DATASET Colon Cancer NCI # Gene 000 9905 # Sample 6 68 # Class 9 CLASS NAME # SAMPLE C Tumor 40 C Normal Table. The Characteristics of Gene Expression Data Sets A Performance Comparison. Figure (a) shows the running time of three algorithms on the Colon Cancer data set. As can be seen, when the h-confidence threshold is less than 0.35, the FP-Tree can be an order of magnitude faster than Mafia and DiffEclat is not very efficient and become unscalable when the h-confidence threshold is low. Also, Figure (b) shows the performance of the proposed algorithms for mining sample patterns on the NCI data set. Similar to the observation from the Colon Cancer data set, we can also observe that when the h-confidence threshold is less than, the FPTree can be an order of magnitude faster than Mafia. However, MAFIA has a better performance when the h-confidence threshold is high. Another observation is that the performance DiffEclat is not scalable when the h-confidence threshold is low. Running Time (sec) 0000 000 00 0 Mafia diffmafia FPTree Running Time (sec) 000 00 0 Mafia DiffEclat FPTree 0.45 0.4 0.35 H-Confidence Threshold (a) Colon 0.3 0.64 0.6 0.6 8 6 4 H-confidence Thresholds (b) NCI Fig.. The Performance Comparison on Colon and NCI data sets. Proccessing Time (sec) 000 00 0 Pruning Ratio = 0 0.30 0.05 Proccessing Time (sec) 0000 000 00 0 pruning ratio = 0 0.30 0.05 0.45 0.4 0.35 H-confidence Threshold (a) Colon 0.3 0.6 8 6 4 0.48 H-Confidence Threshold (b) NCI 0.46 Fig.. The Effect of Clique Pruning on Colon Cancer and NCI Data Sets. The Effect of Clique Pruning. Figure demonstrate the effect of clique pruning on Colon and NCI data sets using the algorithm based on Mafia. As

can be seen from both figures, with the increase of the clique pruning ratio, the running time is reduced significantly. The running time can be orders of magnitude faster if we target on hyperclique patterns with high affinity. Another benefit is that, the proposed algorithm can even identify patterns at a very low level support when the clique pruning ratio is at a certain level. 5 Conclusions In this paper, we addressed the problem of mining quantitative maximal hyperclique patterns in the data sets with continuous attributes. Instead of mapping continuous attributes into binary attributes, we applied a data normalization method. Also, we provided algorithms for finding quantitative maximal hyperclique patterns. These algorithms are built on top of three state-of-the-art association pattern mining algorithms and have included a clique pruning method to perform pruning for individual items. Finally, the performance of the algorithms have been demonstrated using real-world data sets. Acknowledgement This paper was partially supported by NSF grant #ACI-0305567 and NSF grant #CCF-054796. References. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD 93, May 993.. U. Alon, N. Barkai, D.A. Notterman, and et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissuese probed by oligonucleotide arrays. Proc. Natl. Acad. Sci, 96:6745 6750, June 999. 3. D Burdick, M Calimlim, and J Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, 00. 4. Eui-Hong Han, George Karypis, and Vipin Kumar. Tr# 97-068: Min-apriori: An algorithm for finding association rules in data with continuous attributes. Technical report, Department of Computer Science, University of Minnesota, 997. 5. Y. Huang, H. Xiong, W. Wu, and Z. Zhang. A hybrid approach for mining maximal hyperclique patterns. In ICTAI, 004. 6. J.Han, J.Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 000. 7. D.T. Ross, U. Scherf, and et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 4(3):7 34, 000. 8. Ramakrishnan Srikant and Rakesh Agrawal. Mining quantitative association rules in large relational tables. In SIGMOD 96, 996. 9. Michael Steinbach, Pang-Ning Tan, Hui Xiong, and Vipin Kumar. Extending the Notion of Support. In ACM SIGKDD, 004. 0. H. Xiong, M. Steinbach, P. Tan, and V. Kumpar. HICAP: Hierarchial Clustering with Pattern Preservation. In Proc. of SIAM Int l Conf. on Data Mining, 004.. H. Xiong, P. Tan, and V. Kumar. Mining strong affinity association patterns in data sets with skewed support distribution. In ICDM 003, USA, 003.. Mahommed Zaki and Karam Gouda. Fast vertical mining using diffsets. In ACM SIGKDD, 003.