Efficient GSP Implementation based on XML Databases

Similar documents
DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Provenance Management in Databases under Schema Evolution

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

A Sliding Windows based Dual Support Framework for Discovering Emerging Trends from Temporal Data

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Mining of Web Server Logs using Extended Apriori Algorithm

Sequential Pattern Mining Methods: A Snap Shot

Comparing the Performance of Frequent Itemsets Mining Algorithms

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Web page recommendation using a stochastic process model

A Graph-Based Approach for Mining Closed Large Itemsets

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Efficient Algorithm for Frequent Itemset Generation in Big Data

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Product presentations can be more intelligently planned

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Modified Apriori Algorithm for Fast and Accurate Generation of Frequent Item Sets

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Mining High Average-Utility Itemsets

A Modified Apriori Algorithm

Sequences Modeling and Analysis Based on Complex Network

Research Article Apriori Association Rule Algorithms using VMware Environment

A P2P-based Incremental Web Ranking Algorithm

Association Rule Mining. Entscheidungsunterstützungssysteme

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

Improved Frequent Pattern Mining Algorithm with Indexing

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Web Service Usage Mining: Mining For Executable Sequences

High Utility Web Access Patterns Mining from Distributed Databases

Optimization using Ant Colony Algorithm

An Algorithm for Mining Large Sequences in Databases

Efficient Tree Based Structure for Mining Frequent Pattern from Transactional Databases

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

1 On Analyzing Web Log Data: A

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Association Rule Mining from XML Data

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

An Improved Apriori Algorithm for Association Rules

Applying Data Mining to Wireless Networks

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Introducing SQL Query Verifier Plugin

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Binary Sequences and Association Graphs for Fast Detection of Sequential Patterns

Efficiently Mining Positive Correlation Rules

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

3-D MRI Brain Scan Classification Using A Point Series Based Representation

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

2. Discovery of Association Rules

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Mining Quantitative Association Rules on Overlapped Intervals

A New Method for Mining High Average Utility Itemsets

Mining Vague Association Rules

An Efficient XML Index Structure with Bottom-Up Query Processing

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

Dynamic Clustering of Data with Modified K-Means Algorithm

Monotone Constraints in Frequent Tree Mining

FUFM-High Utility Itemsets in Transactional Database

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Analyzing Working of FP-Growth Algorithm for Frequent Pattern Mining

Appropriate Item Partition for Improving the Mining Performance

Performance Improvements. IBM Almaden Research Center. Abstract. The problem of mining sequential patterns was recently introduced

Performance Based Study of Association Rule Algorithms On Voter DB

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Special Issue of IJCIM Proceedings of the

A fast parallel algorithm for frequent itemsets mining

A DTD-Syntax-Tree Based XML file Modularization Browsing Technique

A Fast and High Throughput SQL Query System for Big Data

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Mining Temporal Association Rules in Network Traffic Data

Chapter 1, Introduction

Using Association Rules for Better Treatment of Missing Values

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Bit Mask Search Algorithm for Trajectory Database Mining

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

Mining Temporal Indirect Associations

Application of Web Mining with XML Data using XQuery

CS570 Introduction to Data Mining

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms

Fast Algorithm for Mining Association Rules

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

SQL Based Frequent Pattern Mining with FP-growth

Transcription:

212 International Conference on Information and Knowledge Management (ICIKM 212) IPCSIT vol.45 (212) (212) IACSIT Press, Singapore Efficient GSP Implementation based on Databases Porjet Sansai and Juggapong Natwichai Department of Computer Engineering Faculty of Engineering, Chiang Mai University Abstract. Sequential pattern mining is one of the most important data mining tasks. It can discover important patterns subjected to the minimum support and the time constraints from the transaction data. In this paper, we address the problem of sequential pattern mining on databases. More specifically, the GSP algorithm, which is one of the most important sequential pattern mining algorithms, is realized on the XQuery language. The efficient issue of the implementation is addressed based on the structure to suit the algorithm. From the experiment results, our proposed approach is much more efficient than the traditional implementation. Keywords: Sequential Pattern, databases, Implementation 1. Introduction As information is the one of the important keys for success in businesses. To get the information from the gathered data, data mining, which is a type of data processing to discover useful information from large databases, has been utilized widely. Subsequently, the information can be utilized to improve the businesses e.g. discovering of customers purchasing patterns, or determining the appropriate campaign for the customers. In [1], Srikant and Agrawal proposed GSP algorithm to discover sequential patterns based on the transaction data model. In their model, a database is a collection of transactions. Each transaction is a set of the items I = {i 1, i 2,..., i m }, and is associated with the transaction owner ID and the transaction time. A sequence to be discovered is an ordered list of itemsets. First, the GSP takes each item i in I to form a 1- length candidate sequence <{i}>, then it scans all 1-sequence candidates that are in the database to get the frequent 1-length sequences, a sequence is frequent if it satisfies the minimum support. Next, the GSP generates the set of 2-length candidate sequence from the frequent 1-length sequences. Then it scans the database to extract all frequent 2-length sequences. The algorithm makes multiple passes over the data until there is no frequent k-length sequence discovered. From the experiments, the GSP is much faster than AprioriAll algorithm. It can also discover all sequential patterns with maximum and/or minimum time gaps and sliding windows. The GSP algorithm can be considered as a temporal data processing. Since its feature is suitable for implementing on temporal databases. For example, the sliding window, which is one of the GSP-required features, can be implemented on temporal databases directly. With the sliding window, a data-sequence contributes to the support of a sequence by allowing a set of transactions to contain an element of sequence, as long as the difference in transaction-time between the transactions in the set is less than the user-specified windows-size. An example of pattern mining on temporal databases is in [2]. In which, the authors proposed a novel approach to efficiently extract emerging patterns in temporal data using a sliding window coupled with a dual support mechanism. An advantage of the approach is it uses less memory consumption. It also limits I/O and fewer computations by utilizing the previously computed frequent sets. This avoids recalculation of support counts that already exist in between overlapped windows. 113

Temporal databases were initially introduced in [3] as a type of databases that supports valid-time, transaction-time, or both. In [4], the authors showed a mechanism to query in a temporal database. The temporal relational algebra (TRA) was also introduced. New temporal operators such as Until and Since or Coalescing functions was defined and revised to manipulate temporal data to be able to extract correct information from a temporal database. In [5], the authors focused on directly supporting efficient temporal coalescing queries within existing commercial, without any extension to current systems. They proposed two approaches. The first approach takes the advantage of SQL23 s OLAP support. The second approach applies user defined aggregation. Both approaches only need to perform a single scan of the database of the query execution and use minimal joins, which make them possible to provide efficient temporal coalescing. The study demonstrates that current commercial s are mature enough to provide efficient complex temporal queries, and they typically also provides a direction for commercial temporal database support within existing engine. For example, a business owner might be interested in the history of product orders bought between 1/1/212 to 15/1/212. The SQL statement of such query can be written as follows: SELECT CUSID, BUYDATE, ITEM FROM ORDER, ORDER_DETAIL WHERE ORDER.TID = ORDER_DETAIL.TID AND (BUYDATE BETWEEN 1/1/212 AND 15/1/212) In [5], the authors stated that more complex temporal-queries, e.g. coalescing queries, can be written in SQL. However, the SQL statements can become very complex and often have multiple nested "NOT EXISTS". For example, The SQL statement for the product orders history where the customer id is 2 and between 1/1/212 to 15/1/212 is as follows: WITH Temp(ITEMS_BOUGHT, TRANSACTION_TIME AS TSTART, TRANSACTION_TIME AS TEND) AS (SELECT ITEMS, TSTART, END FROM ITEMS_BOUGHT WHERE CUSTOMER = 2 ) SELECT DISTINCT F.ITEMS_BOUGHT, F. TRANSACTION_TIME AS T_START, F. TRANSACTION_TIME AS T_END FROM Temp AS F, Temp AS L WHERE F. T_START < L. T_END AND F.Items_bought = L.Items_bought AND NOT EXISTS (SELECT * FROM Temp AS M WHERE M.Items_bought = F.Item_bought AND F. TRANSACTION_TIME < M. TRANSACTION_TIME AND M.T_START < L.T_END AND NOT EXISTS (SELECT * FROM Temp AS T1 WHERE T1. Item_bought = F. Item_bought AND T1. T_START < M. T_START AND M.T_START <= T1.T_END)) AND NOT EXISTS (SELECT * FROM Temp AS T2 WHERE T2. Item_bought = F. Item_bought AND ((T2. T_START < F. T_START AND F.T_START <= T2.T_END) OR (T2.T_START < L.T_END AND L.T_END < T2.T_END)) In [6], the authors showed that the temporal data can be stored and queried efficiently using to provide temporally-grouped representations of the database history. In addition, SQL/ can be used to implement temporal queries expressed in XQuery. From the above query, it can be written in XQuery statement as follows: xquery fn:distinct(for $info in db2-fn:xmlcolumn("items")/buyinfo/cid[@id = 2 ]) for $info2 in db2-fn:xmlcolumn("bought.items")/buyinfo/cid/item[.=$info/item]/ [@T_START < $info/item/@t_end and @T_END <= $info/item/@t_end] return < item T_START={$info2/item/@T_START} T_END = {$info2/item/@t_end} > { $info2/item } </item> 114

It can be seen that the XQuery language is more expressive than SQL language for this purpose. Since the XQuery language can process semi-structured data efficiently. Also, it is more comprehend than the SQL statements. Furthermore, in the same work, the proposed system demonstrates that the temporal data and temporal queries can be supported efficiently by the proposed features e.g. data clustering and indexing techniques. Subsequently, the approaches are realized into ArchIS (Archival Information System), in which transaction-time capability can be supported to the existing and applications efficiently. In this paper, we propose an implementation of GSP algorithm on Databases. The structure of the data suited for the algorithm is considered, since such structure can affect the efficiency of the algorithm. Also, the GSP algorithm is implemented by the XQuery language. The proposed work will be evaluated by the experiments. In which, our proposed work will be compared with the algorithm implemented on traditional. 2. Approaches for GSP Implementation 2.1. Transaction Data in Structure First, the structure of the data is considered. As mentioned in Section 1, each transaction contains transaction owner, transaction time, and the itemsets. We propose to store the transaction as the structure in Fig. 1a). It can be seen that the itemsets by the same transaction owner are stored in a single tag (own). And, the transaction owner identifier is presented as an attribute (id). Within an owner tag, the transaction time of each item is identified by the time attribute. For example, it can be seen that transaction owner with id 1 has item 1 and 5 at the time stamp 6. This approach is based on the fact that parsers, in general, have to split the opening and closing tag when the enclosing data are to be read. Thus it can be efficient since the transaction owner identifier, which might be scanned multiple times, can be read without re-traversal into the tree structure of the data [7]. This is also the case for the transaction time which is presented as an attribute. <transactions> Let C k denote the set of candidate k-itemsets. <own id="1"> Let Owner denote all transaction owners. <item time="6">1</item> for each (c C k ){ <item time="6">5</item> for each owner in Owner <item time="7">6</item> Parse element by the white spaces. <item date="7">35</item> XQuery1: Determine the first time stamp of supporting transactions. for each time stamp from XQuery1 </own> for each element... XQuery2: Determine the element with minimum time stamp. <own id="1"> if the element satisfies the time constraints... Increase the support count for the candidate. </own> </transactions> a) Structure b) GSP k-length candidate verification Fig. 1: Structure and Algorithm. 2.2. GSP Algorithm Implementation For the algorithm, it is almost a straight forward implementation besides some part of it, which needs to query from the data. We present only the k-length candidate verification part, since the rest of the algorithm is exactly the same as the GSP algorithm. The algorithm is presented in Fig. 1b). It can be seen that the algorithm composes of the candidate verification loop (C k ), the transaction owner loop (Own), and time constraints loop. In the transaction owner loop, the time stamp, which such owner starts to support the candidate, needs to be determined from the structure. It is executed by XQuery 1 as shown in Fig. 2a). 115

And, the minimum time stamp of the supporting transactions within the time constraints is determined by XQuery 2 as shown in Fig. 2b). xquery fn:distinct-values(for $info in db2-fn: xmlcolumn("dbt2v6.infogsp")/ transactions/own[@id="1"] return $info/item[.="2"]/@time) a) XQuery 1 xquery fn:min(for $info in db2-fn: xmlcolumn("dbt2v6.infogsp")/ transactions/own[@id = own] return $info/item[.=\"" + patternarr[p] + "\" and @time >= " + window_time + " and @time <= " + (window_time+ max_gap) + "]/@time) b) XQuery 2 Fig. 2: XQueries. 3. Experiment Results In this section, the efficiency of the proposed GSP implementation on database is evaluated. The experiment was conducted on an Intel Core i5 2.4 GHz PC running Windows7 OS, with 4GB memory and 5 GB SATA hard drive. The GSP algorithm is implemented on two different database structures: data, and Relational data. IBM Quest Market-Basket Synthetic Data Generator was used to generate datasets with 2 transactions for our experiments. IBM DB2 Express-C 9.7.5 is selected as the database engine since it supports both traditional and. The datasets were stored in the two mentioned database structures. In each experiment, the execution time represented the efficiency form multiple runs are reported. 6 5 4 3 2 1.5.6.7.8.9 Fig. 3: Execution time without temporal constraints. Fig. 3 shows the experiment result represented the execution time of sequential pattern discovery without any temporal constraints. On the x-axis, the minimum support is varied from.5 to.9. On the y-axis, the execution time is reported. From the result, the execution time of the GSP algorithm implemented on database is much lower, approximately 3% less than the implementation. Also, the difference is larger when the minimum support decreases. The reason behind this is the has to scan every transaction for determining whether an item exists in the transaction. While, in the database, such unnecessary multiple scan can be avoided by the proposed approach. Subsequently, the proposed work was evaluated for the GSP features, i.e. sliding window and the maximum time gap. In Fig. 4a), the sliding window is investigated; it is fixed at 5 time-point. And, in Fig 4b), the execution time when the maximum gap is fixed at 5 time-point is investigated. From Fig. 4, it can be seen that when the time constraints is applied, the approach is less efficient. This is because of the time constraints enforce the algorithm to determine the minimum time stamp of the supporting transactions in XQuery 2, which requires more computation expense. However, when the support is increased, the difference between the two approaches is larger, since the XQuery 2 can eliminate the candidates which 116

satisfy the time constraints with less cost. In summary, it is clear that the implementation on structure is much more efficient than the implementation. 4 45 35 3 25 2 15 1 5.5.6.7.8.9 4 35 3 25 2 15 1 5.5.6.7.8.9 a) Sliding window effect b) Maximum gap effect 4. Conclusion Fig.4: Execution time with time constraints feature. In this paper, we have proposed an approach to implement the GSP algorithm on databases. The structure, which can provide a high efficiency, have been proposed as well as the corresponding algorithm based on XQuery language. The proposed approach has been validated by the experiments. The results have shown that the -based implementation is much more efficient than the approach. Also, it has been found that the time constraints degrade the efficiency of the proposed approach. However, the approach can still outperform the traditional approach. 5. References [1] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvement. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, 1996, Springer-Verlag, London, UK, pp. 3-17. [2] M. Sulaiman Khan, F. Coenen, D. Reid, R. Patel and L. Archer. A Sliding Windows based Dual Support Framework for Discovering Emerging Trends from Temporal Data. Knowledge-Based Systems, 21, 23(4), pp. 316-322. [3] C. Jensen. Temporal Database Management, Ph.D. Thesis, 2. Department of Computer Science, Aalborg University. [4] J. Patel. Temporal Database System, Master Thesis, 23. Department of Computing, Imperial College, University of London. [5] X. Zhou, F. Wang, and C. Zaniolo. Efficient Temporal Coalescing Query Support in Relational Database Systems. In Proceedings of the 17th international conference on Database and Expert Systems Applications (DEXA'6), 26, pp. 676-686. [6] F. Wang, C. Zaniolo, and X. Zhou. ArchIS: an -based approach to transaction-time temporal database systems. The VLDB Journal, 28, 17(6), pp. 1445-1463. [7] F. Özcan, D. Chamberlin, K. Kulkarni, and J.-E. Michels. Integration of SQL and XQuery in IBM DB2. IBM System Journal, 26, 45(2), pp. 245-27. 117