Maintaining Frequent Itemsets over High-Speed Data Streams

Similar documents
Maintaining Frequent Closed Itemsets over a Sliding Window

Maintaining frequent closed itemsets over a sliding window

Incremental updates of closed frequent itemsets over continuous data streams

FREQUENT ITEMSET MINING IN TRANSACTIONAL DATA STREAMS BASED ON QUALITY CONTROL AND RESOURCE ADAPTATION

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

Mining Vague Association Rules

Mining Maximum frequent item sets over data streams using Transaction Sliding Window Techniques

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

Nesnelerin İnternetinde Veri Analizi

Mining Frequent Itemsets in Time-Varying Data Streams

Mining Frequent Patterns with Screening of Null Transactions Using Different Models

Association Rule Mining: FP-Growth

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Temporal Weighted Association Rule Mining for Classification

Maintenance of the Prelarge Trees for Record Deletion

An Algorithm for Mining Frequent Itemsets from Library Big Data

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Interactive Mining of Frequent Itemsets over Arbitrary Time Intervals in a Data Stream

An Efficient Sliding Window Based Algorithm for Adaptive Frequent Itemset Mining over Data Streams

Random Sampling over Data Streams for Sequential Pattern Mining

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

Online Mining of Frequent Query Trees over XML Data Streams

Comparing the Performance of Frequent Itemsets Mining Algorithms

Leveraging Set Relations in Exact Set Similarity Join

Generation of Potential High Utility Itemsets from Transactional Databases

Mining top-k frequent itemsets from data streams

Minig Top-K High Utility Itemsets - Report

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

An Improved Apriori Algorithm for Association Rules

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

Comparative Study of Subspace Clustering Algorithms

Data Mining Part 3. Associations Rules

In-stream Frequent Itemset Mining with Output Proportional Memory Footprint

Mining Frequent Patterns from Data streams using Dynamic DP-tree

Frequent Pattern Mining in Data Streams. Raymond Martin

A Robust Wipe Detection Algorithm

Frequent Patterns mining in time-sensitive Data Stream

A New Method for Mining High Average Utility Itemsets

Closed Pattern Mining from n-ary Relations

CS570 Introduction to Data Mining

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

On Multiple Query Optimization in Data Mining

Multiterm Keyword Searching For Key Value Based NoSQL System

Frequent Itemsets Melange

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

An Efficient Algorithm for finding high utility itemsets from online sell

Data Mining Query Scheduling for Apriori Common Counting

An Algorithm for Interesting Negated Itemsets for Negative Association Rules from XML Stream Data

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Implementation and Experiments of Frequent GPS Trajectory Pattern Mining Algorithms

Memory issues in frequent itemset mining

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

A Survey on Disk-based Genome. Sequence Indexing

Mining Imperfectly Sporadic Rules with Two Thresholds

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

A SHORTEST ANTECEDENT SET ALGORITHM FOR MINING ASSOCIATION RULES*

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

UDS-FIM: An Efficient Algorithm of Frequent Itemsets Mining over Uncertain Transaction Data Streams

CSCI6405 Project - Association rules mining


Chapter 4: Mining Frequent Patterns, Associations and Correlations

Enhancing K-means Clustering Algorithm with Improved Initial Center

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Mining High Average-Utility Itemsets

Efficient High Utility Itemset Mining using extended UP Growth on Educational Feedback Dataset

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Appropriate Item Partition for Improving the Mining Performance

Data Access Paths for Frequent Itemsets Discovery

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

CHUIs-Concise and Lossless representation of High Utility Itemsets

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

N-Model Tests for VLSI Circuits

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

A Novel Method of Optimizing Website Structure

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Parallelizing Frequent Itemset Mining with FP-Trees

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules

CSE 5243 INTRO. TO DATA MINING

Mining Top-K Path Traversal Patterns over Streaming Web Click-Sequences *

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Adaptive Load Shedding for Mining Frequent Patterns from Data Streams

Information Sciences

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

An Approximate Scheme to Mine Frequent Patterns over Data Streams

Comparison of Different Navigation Prediction Techniques

Monotone Constraints in Frequent Tree Mining

Data Structure for Association Rule Mining: T-Trees and P-Trees

ANALYSIS AND EVALUATION OF DISTRIBUTED DENIAL OF SERVICE ATTACKS IDENTIFICATION METHODS

Stream Sequential Pattern Mining with Precise Error Bounds

Transcription:

Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong, China {csjames, keyiping, wilfred}@cs.ust.hk Abstract. We propose a false-negative approach to approximate the set of frequent itemsets (FIs) over a sliding window. Existing approximate algorithms use an error parameter, ǫ, to control the accuracy of the mining result. However, the use of ǫ leads to a dilemma. A smaller ǫ gives a more accurate mining result but higher computational complexity, while increasing ǫ degrades the mining accuracy. We address this dilemma by introducing a progressively increasing minimum support function. When an itemset is retained in the window longer, we require its minimum support to approach the minimum support of an FI. Thus, the number of potential FIs to be maintained is greatly reduced. Our experiments show that our algorithm not only attains highly accurate mining results, but also runs significantly faster and consumes less memory than do existing algorithms for mining FIs over a sliding window. 1 Introduction Frequent itemset (FI) mining is fundamental to many important data mining tasks. Recently, the increasing prominence of data streams has led to the study of online mining of FIs [5]. Due to the constraints on both memory consumption and processing efficiency of stream processing, together with the exploratory nature of FI mining, research studies have sought to approximate FIs over streams. Existing approximation techniques for mining FIs are mainly false-positive [5, 4,1,2]. These approaches use an error parameter, ǫ, to control the quality of the approximation. However, the use of ǫ leads to a dilemma. A smaller ǫ gives a more accurate mining result. Unfortunately, a smaller ǫ also results in an enormously larger number of itemsets to be maintained, thereby drastically increasing the memory consumption and lowering processing efficiency. A falsenegative approach [6] is proposed recently to address this dilemma. However, the method focuses on the entire history of a stream and does not distinguish recent itemsets from old ones. We propose a false-negative approach to mine FIs over high-speed data streams. Our method places greater importance on recent data by adopting a sliding window model. To tackle the problem introduced by the use of ǫ, we This work is partially supported by RGC CERG under grant number HKUST6185/2E and HKUST6185/3E.

consider ǫ as a relaxed minimum support threshold and propose to progressively increase the value of ǫ for an itemset as it is kept longer in a window. In this way, the number of itemsets to be maintained is greatly reduced, thereby saving both memory and processing power. We design a progressively increasing minimum support function and devise an algorithm to mine FIs over a sliding window. Our experiments show that our approach obtains highly accurate mining results even with a large ǫ, so that the mining efficiency is significantly improved. In most cases, our algorithm runs significantly faster and consumes less memory than do the state-of-the-art algorithms [5, 2], while attains the same level of accuracy. 2 Preliminaries Let I = {x 1, x 2,..., x m } be a set of items. An itemset is a subset of I. A transaction, X, is an itemset and X supports an itemset, Y, if X Y. A transaction data stream is a continuous sequence of transactions. We denote a time unit in the stream as t i, within which a variable number of transactions may arrive. A window or a time interval in the stream is a set of successive time units, denoted as T = t i,...,t j, where i j, or simply T = t i if i = j. A sliding window in the stream is a window that slides forward for every time unit. The window at each slide has a fixed number, w, of time units and w is called the size of the window. In this paper, we use t τ to denote the current time unit. Thus, the current window is W = t τ w+1,..., t τ. We define trans(t) as the set of transactions that arrive on the stream in a time interval T and trans(t) as the number of transactions in trans(t). The support of an itemset X over T, denoted as sup(x, T), is the number of transactions in trans(t) that support X. Given a predefined Minimum Support Threshold (MST), σ ( σ 1), we say that X is a frequent itemset (FI) over T if sup(x, T) σ trans(t). Given a transaction data stream and an MST σ, the problem of FI mining over a sliding window is to find the set of all FIs over the window at each slide. 3 A Progressively Increasing MST Function Existing approaches [5,4,2] use an error parameter, ǫ, to control the mining accuracy, which leads to a dilemma. We tackle this problem by considering ǫ = rσ as a relaxed MST, where r ( r 1) is the relaxation rate, to mine the set of FIs over each time unit t in the sliding window. Since all itemsets whose support is less than rσ trans(t) are discarded, we define the computed support as follows. Definition 1 (Computed Support) The computed support of an itemset X over a time unit t is defined as follows: { if sup(x, t) < rσ trans(t) sup(x, t) = sup(x, t) otherwise. The computed support of X over a time interval T = t j,..., t l is defined as l sup(x, T) = sup(x, t i ). i=j Based on the computed support of an itemset, we apply a progressively increasing MST function to define a semi-frequent itemset.

Definition 2 (Semi-Frequent Itemset) Let W = t τ w+1,...,t τ be a window of size w and T k = t τ k+1,..., t τ, where 1 k w, be the most recent k time units in W. We define a progressively increasing function minsup(k) = m k r k, where m k = σ trans(t k ) and r k = ( 1 r w )(k 1) + r. An itemset X is a semi-frequent itemset (semi-fi) over W if sup(x, T k ) minsup(k), where k = τ o + 1 and t o is the oldest time unit such that sup(x, t o ) >. The first term m k in the minsup function in Definition 2 is the minimum support required for an FI over T k, while the second term r k progressively increases the relaxed MST rσ at the rate of ((1 r)/w) for each older time unit in the window. We keep X in the window only if its computed support over T k is no less than minsup(k), where T k is the time interval starting from the time unit t o, in which the support of X is computed, up to the current time unit t τ. 4 Mining FIs over a Sliding Window We use a prefix tree to keep the semi-fis. A node in the prefix tree represents an itemset, X, and has three fields: (1) item which is the last item of X; (2) uid(x) which is the ID of the time unit, t uid(x), in which X is inserted into the prefix tree; (3) sup(x) which is the computed support of X since t uid(x). The algorithm for mining FIs over a sliding window, MineSW, is given in Algorithm 1, which is self-explanatory. Algorithm 1 (MineSW) Input: (1) An empty prefix tree. (2) σ, r and w. (3) A transaction data stream. Output: An approximate set of FIs of the window at each slide. 1. Mine all FIs over each time unit using a relaxed MST rσ. 2. Initialization: For each of the first w time units, t i (1 i w), mine all FIs from trans(t i). For each mined itemset, X, check if X is in the prefix tree. (a) If X is in the prefix tree, perform the following operations: (i) Add sup(x, t i) to sup(x); (ii) If sup(x) < minsup(i uid(x)+1), remove X from the prefix tree and stop mining the supersets of X from trans(t i). (b) If X is not in the prefix tree, create a new node for X in the prefix tree with uid(x) = i and sup(x) = sup(x, t i). 3. Incremental Update: For each expiring time unit, t τ w+1, mine all FIs from trans(t τ w+1). For each mined itemset, X: If X is in the prefix tree and τ uid(x)+1 w, subtract sup(x, t τ w+1) from sup(x). Otherwise, stop mining the supersets of X from trans(t τ w+1). If sup(x) becomes, remove X from the prefix tree. Otherwise, set uid(x) = τ w + 2. For each incoming time unit, t τ, mine all FIs from trans(t τ). For each mined itemset, X, check if X is in the prefix tree. (a) If X is in the prefix tree, perform the following operations: (i) Add sup(x, t τ) to sup(x); (ii) If either τ uid(x) + 1 w and sup(x) < minsup(τ uid(x) + 1), or τ uid(x) + 1 > w and sup(x) < minsup(w), remove X from the prefix tree and stop mining the supersets of X from trans(t τ). (b) If X is not in the prefix tree, create a new node for X in the prefix tree with uid(x) = τ and sup(x) = sup(x, t τ).

4. Pruning and Outputting: Scan the prefix tree once. For each itemset X visited: Remove X and its descendants from the prefix tree if (1) τ uid(x) + 1 w and sup(x) < minsup(τ uid(x) + 1), or (2) τ uid(x) + 1 > w and sup(x) < minsup(w). Output X if sup(x) σ trans(w) (we can thus set minsup(w) = σ trans(w) to prune more itemsets). 5 Experimental Evaluation We run our experiments on a Sun Ultra-SPARC III with 9 MHz CPU and 4GB RAM. We compare our algorithm MineSW with a variant of the Lossy Counting algorithm [5] applied in the sliding window model, denoted as. We remark that, which updates a batch of incoming/expiring transactions at each window slide, is different from the algorithm proposed by Chang and Lee [2], which updates on each incoming/expiring transaction. We implement both algorithms and find that the algorithm by Chang and Lee is much slower than and runs out of our 4GB memory. We generate two types of data streams, t1i4 and t15i6, using a generator [3] that modifies the IBM data generator. We first find (see details in [3]) that when r increases from.1 to 1, the precision of (ǫ = rσ in ) drops from 98% to around 1%, while the recall of MineSW only drops from 99% to around 9%. This result reveals that the estimation mechanism of the Lossy Counting algorithm relies on ǫ to control the mining accuracy, while our progressively increasing minsup function maintains a high accuracy which is only slightly affected by the change in r. Since increasing r means faster mining process and less memory consumption, we can use a larger r to obtain highly accurate mining results at much faster speed and less memory consumption. We test r =.1 and r =.5 for MineSW. According to Lossy Counting [5], a good choice of ǫ is.1σ and hence we set r =.1 for. Fig. 1 (a) and (b) show that for all σ, the precision of is over 94% and the recall of MineSW is over 96% (mostly over 99%). The recall of MineSW (r =.5) is only slightly lower than that of MineSW (r =.1). However, Fig. 2 (a) and (b) show that MineSW (r =.5) is significantly faster than MineSW (r =.1) and, especially when σ is small. Fig. 3 (a) and (b) show the memory consumption of the algorithms in terms of the number of itemsets maintained at the end of each slide. The number of itemsets kept by MineSW (r =.1) is about 1.5 times less than that of, while that kept by MineSW (r =.5) is less than that of by up to several orders of magnitude. 6 Conclusions We propose a progressively increasing minimum support function, which allows us to increase ǫ at the expense of only slightly degraded accuracy, but significantly improves the mining efficiency and saves memory usage. We verify, by extensive experiments, that our algorithm is significantly faster and consumes less memory than existing algorithms, while attains the same level of accuracy. When applications require highly accurate mining results, our experiments show that by setting ǫ =.1σ (a rule-of-thumb choice of ǫ in Lossy Counting [5]), our algorithm attains 1% precision and over 99.99% recall.

1 1 98 98 Precision (%) 96 94 92 9 MineSW(r=.5, t1i4) MineSW(r=.1, t1i4) (t1i4) MineSW(r=.5, t15i6) MineSW(r=.1, t15i6) (t15i6) Recall (%) 96 94 92 9 MineSW(r=.5, t1i4) MineSW(r=.1, t1i4) (t1i4) MineSW(r=.5, t15i6) MineSW(r=.1, t15i6) (t15i6) (a) Precision (b) Recall Fig.1. Precision and Recall with Varying Minimum Support Threshold Time (sec) 4 35 3 25 2 15 1 5 Time (sec) 35 3 25 2 15 1 5 (a) Processing Time (t1i4) (b) Processing Time (t15i6) Fig.2. Processing Time with Varying Minimum Support Threshold # of Itemsets (K) 3 25 2 15 1 5 # of Itemsets (K) 35 3 25 2 15 1 5 (a) Memory Consumption (t1i4) (b) Memory Consumption (t15i6) Fig.3. Memory Consumption with Varying Minimum Support Threshold References 1. J. H. Chang and W. S. Lee. estwin: Adaptively Monitoring the Recent Change of Frequent Itemsets over Online Data Streams. In Proc. of CIKM, 23. 2. J. H. Chang and W. S. Lee. A Sliding Window method for Finding Recently Frequent Itemsets over Online Data Streams. In Journal of Information Science and Engineering, Vol. 2, No. 4, July, 24. 3. J. Cheng, Y. Ke, and W. Ng. Maintaining Frequent Itemsets over High-Speed Data Streams. Technical Report, http://www.cs.ust.hk/ csjames/pakdd6tr.pdf. 4. H. Li, S. Lee, and M. Shan. An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams. In Proc. of First International Workshop on Knowledge Discovery in Data Streams, 24. 5. G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In Proc. of VLDB, 22. 6. J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. In VLDB, 24.