Mining of Web Server Logs using Extended Apriori Algorithm

Similar documents
Improved Frequent Pattern Mining Algorithm with Indexing

DATA MINING II - 1DL460

CS570 Introduction to Data Mining

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Performance Based Study of Association Rule Algorithms On Voter DB

Comparing the Performance of Frequent Itemsets Mining Algorithms

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Improved Apriori Algorithm for Association Rules

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Temporal Weighted Association Rule Mining for Classification

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Data Mining Part 3. Associations Rules

Optimization using Ant Colony Algorithm

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

Web Log Mining using Improved Version of Proposed Algorithm

Improved Apriori Algorithms- A Survey

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

A mining method for tracking changes in temporal association rules from an encoded database

Mining Frequent Patterns with Counting Inference at Multiple Levels

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

CSCI6405 Project - Association rules mining

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Performance Evaluation for Frequent Pattern mining Algorithm

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

An Algorithm for Frequent Pattern Mining Based On Apriori

Medical Data Mining Based on Association Rules

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques

An Improved Technique for Frequent Itemset Mining

Parallelizing Frequent Itemset Mining with FP-Trees

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

Product presentations can be more intelligently planned

Mining High Average-Utility Itemsets

Hybrid Approach for Improving Efficiency of Apriori Algorithm on Frequent Itemset

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

A Comparative Study of Association Mining Algorithms for Market Basket Analysis

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Association Rule Mining from XML Data

Parallel Algorithms for Discovery of Association Rules

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

Review paper on Mining Association rule and frequent patterns using Apriori Algorithm

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Study on Apriori Algorithm and its Application in Grocery Store

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Association Rule Mining. Entscheidungsunterstützungssysteme

Analysis of Association rules for Big Data using Apriori and Frequent Pattern Growth Techniques

Parallel Popular Crime Pattern Mining in Multidimensional Databases

Improved Version of Apriori Algorithm Using Top Down Approach

Available online at ScienceDirect. Procedia Computer Science 37 (2014 )

Association mining rules

Association Rule Mining. Introduction 46. Study core 46

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Fast Algorithm for Mining Association Rules

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

International Journal of Advance Engineering and Research Development. Fuzzy Frequent Pattern Mining by Compressing Large Databases

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

A COMBINTORIAL TREE BASED FREQUENT PATTERN MINING

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

AN IMPROVED APRIORI BASED ALGORITHM FOR ASSOCIATION RULE MINING

Data Structure for Association Rule Mining: T-Trees and P-Trees

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Fahad Kamal Alsheref Information Systems Department, Modern Academy, Cairo Egypt

Appropriate Item Partition for Improving the Mining Performance

ASSOCIATION RULE MINING: MARKET BASKET ANALYSIS OF A GROCERY STORE

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1

Keywords Apriori Growth, FP Split, SNS, frequent patterns.

An Algorithm for Mining Large Sequences in Databases

Mining Temporal Association Rules in Network Traffic Data

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach

Parallel Mining Association Rules in Calculation Grids

A Modified Apriori Algorithm for Fast and Accurate Generation of Frequent Item Sets

A Literature Review of Modern Association Rule Mining Techniques

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

INTELLIGENT SUPERMARKET USING APRIORI

Transcription:

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) www.iasir.net ISSN (Print): 2279-0047 ISSN (Online): 2279-0055 Mining of Web Server Logs using Extended Apriori Algorithm SANJEEV DHAWAN 1, SWATI GOEL 2 1 Faculty of Computer Science & Engineering, University Institute of Engineering and Technology (U.I.E.T.), Kurukshetra University, Kurukshetra-136119, Haryana, INDIA 2 M.Tech. Research Scholar, University Institute of Engineering and Technology (U.I.E.T.), Kurukshetra University, Kurukshetra-136119, Haryana, INDIA Abstract: Association rule mining is one of the most significant techniques in the field of data mining. It is very useful in discovering relationships hidden in large transaction datasets such as frequent patterns, associations etc. One of the popular and important algorithms in this category is Apriori algorithm which finds frequent itemsets using an iterative approach. But it suffers from a major limitation that in case of large databases, it requires a large number of passes while searching the frequent itemsets, thus increasing its scanning time. In order to lessen this time, an improved version of Apriori algorithm, called Extended Apriori is proposed in this paper which decreases the number of transactions in the database, hence reducing size of the database so as to minimize the scanning time. This extended algorithm is then used to mine web server logs of an educational web site in order to discover frequently visited pages by the user and also its performance is compared graphically with existing Apriori algorithm. In the end, the paper also outlines some future research directions in the area of web server log mining. Keywords: Data mining, Web Usage mining, Association rule, Apriori algorithm, Frequent patterns I. Introduction Data mining may be defined as extracting knowledge from a large amount of data by the application of various techniques such as statistical methods, classification, clustering, association rules etc. [1]. Association rule mining is one of the important techniques used in data mining which is used to find associations among a group of frequently accessed items, with support exceeding a pre specified threshold. Support of an itemset may be defined as the percentage of transactions containing the itemset. A popular example of this technique is market basket analysis which analyses buying habits of customers by finding associations between items in their baskets [1] For example, if computer and speakers appear frequently together in a transaction database then it is a frequent itemset. Association rules are also used in other areas such as e-commerce, marketing and promotional campaigns, website optimization and personalization etc. It is evident from the literature that frequent pattern mining is a very expensive task as counting the occurrence of itemsets in database takes a large amount of computational time. As a result, many algorithms were proposed to improve its processing time. One such algorithm is Apriori which is a type of association rule mining algorithm which finds frequent itemsets using a bottom up approach [9]. But its main limitation is that it takes a large amount of time in scanning large datasets. Also, the numbers of candidate itemsets generated are huge. To remove these limitations, many improved variations of apriori have been proposed. The proposed improved Apriori algorithm called Extended Apriori removes the limitation of large scanning time by reducing the size of database which is done by removing unnecessary transactions. This rest of paper is divided into following sections: section 2.0 describes the related work, section 3.0 covers the existing and proposed algorithm, Section 4.0 uses proposes algorithm to mine the patterns from web server logs of an educational web site in order to discover frequently visited pages by a user and also shows graphically, the comparison results between the existing and proposed algorithm using three parameters: support count, transaction count and running time. Section 5.0 concludes the paper and outlines some of the promising areas of the future research. II. Related work Apriori algorithm used for frequent pattern mining was proposed by Aggarwal et al. [7]. Though many variations of this algorithm exist till date, but Apriori is still an area which needs further research. Many variations done on the Apriori algorithm is presented in this section: IJETCAS 13-134; 2012, IJETCAS All Rights Reserved Page 203

The AIS (Agrawal, Imielinski, Swami) algorithm [6] was the first algorithm to address the problem of generating association rules. It uses candidate itemsets in order to find frequent itemsets. It produces candidate itemsets on the fly on each scan of the database and compares them with already found itemsets to check if they are present in the previous dataset. The disadvantage of this algorithm was that it generates too many candidate itemsets and also requires a large number of database scans, hence increasing the processing time of the algorithm. Aggarwal et al. [7] also presented various versions of Apriori namely: Apriori, AprioriTid and AprioriHybrid. Apriori algorithm firstly finds those itemsets that are likely to be frequent. These are known as candidate itemsets. These are generated by joining the previously found frequent patterns. Apriori property is used to prune unneccesary patterns from candidate itemsets, thus reducing its size. Next, it finds those itemsets that are actually frequent by using a minimum support. AprioriTID uses the same candidate generating technique of Apriori with a difference that it does not uses the same database to search for the frequent itemsets. Rather it constructs a database by encoding the candidate itemsets which is smaller in size than the original database. Its main advantage is that it is faster than Apriori algorithm. A further improvement in this field was done with AprioriHybrid, which uses Apriori in the initial stages but shifts to AprioriTID in the later stages. A Direct Hashing and Pruning (DHP) algorithm proposed by [2] generates candidate itemsets using a hash based technique. It removes those entries from hash table which does not fulfill minimum support criteria, thus improving its efficiency. It also utilizes a pruning technique to reduce the database size. Echivalence Class Clustering and Bottom-up Lattice Traversal (Eclat) [5] uses a depth first search technique in contrast to Apriori which uses a bottom up mechanism. It partitions the candidates into equivalence classes which are disjoint to each other. The local database is scanned only once. FP- Growth was an effort to improve Apriori by introducing a data structure called frequent pattern tree. Han et al. [3] proposed an algorithm called FP-growth for frequent pattern mining. It uses a divide and conquer mechanism and does not generate candidate itemsets. Though many algorithms exist for association mining it still suffers from limitation of scanning the database multiple times. Therefore, it is needed to address the issues of reducing the time of database scan. This limitation and some other related problems made us to continue the research work in this particular area. In this paper an improved version of Apriori algorithm called Extended Apriori algorithm is proposed which reduces the database size so as to reduce the time of scanning the database. III. Existing and Proposed algorithm A. Existing algorithm: Apriori Apriori is an algorithm proposed by R. Agrawal et al. [7] in 1994 for mining frequent itemsets. Apriori uses an iterative bottom up approach, where k-itemsets are used tofind (k+1)-itemsets. The procedure is divided into following two steps: a. Join step: Set of frequent 1-itemsets is searched by scanning the database to determine the count for each itemset, and considering only those items that fulfill minimum support criteria. This set is represented as L1. Support of an itemset is the percentage of database transactions in which it appears. In general terms, Lk-1 is joined with itself in order to find candidate k-itemsets denoted as Ck, which is then used to find Lk in the prune step by using the minimum support criteria [4]. b. The prune step: Database is scanned to find the count of each candidate in Ck which will find Lk. Ck can be large, and so this could require heavy processing. To reduce its size, the Apriori property is used in which if any k-1 subset of a candidate k-itemset does not lie in Lk-1, then the candidate itemset can be removed from Ck. Only those itemsets are included in Lk that satisfy the minimum support count criteria [4]. 1. Improving the efficiency of Apriori Transaction reduction: It reduces the number of transactions in the database. One of the methods is to remove all those transactions that do not contain frequent k-itemsets while searching for frequent k+1 itemsets. Reducing the number of passes over data: Many techniques exist for achieving this. One of the methods is to use partitioning method [8]. A partitioning technique needs only two database scans to mine the frequent itemsets. Reducing the number of candidates generated. B. Proposed algorithm: Extended Apriori algorithm The traditional Apriori algorithm suffers from two major bottlenecks: large amount of time is required in scanning database and large number of candidate itemsets generation. Proposed algorithm addresses the first problem by IJETCAS 13-134; 2013, IJETCAS All Rights Reserved Page 204

reducing the time for scanning database which is done by reducing the number of database transactions by incorporating the following two features: (i) Before determining the count of candidate k-itemsets from the database, only those transactions are considered in which length of transaction is greater than or equal to k while those having length less than k are eliminated. (ii) A number field is added along with each database transaction. For each transaction it is tested whether a subset of candidate itemset is present in a transaction and for those transactions it is true, the number field is incremented by one. This process is repeated for each candidate itemset. After the count has been accumulated, those transactions are removed that have Number <(2 length of transaction length of -1)/2, where (2 transaction -1) denotes the total number of subsets of itemsets in a transaction. The left over transactions are taken as input for pruning stage. The algorithmic steps in the form of Pseudo code are shown below: 1. Algorithm: Input: D: a database containing transactions; min_supp: the minimum support threshold. Output: O: frequent itemsets in database, D. Method: Scan D to generate C1 and use it to find L1; For (k = 2;Lk-1 0;k++) do begin Ck = apriori gen(lk-1) //genearates candidates and uses Apriori property For each transaction t Є D do begin If(t.length<k) Delete t; For each candidate c Є Ck do begin For each transaction t Є D do begin // scan D for subset count Ct = subset( CK,t ); //get the subsets of t that are candidates t.number++ For each transactions t Є D do Begin If((t.number<2 t.length -1)/2) Delete t For each transaction t Є D do begin Ct = subset( CK,t ); //get the subsets of t that are candidates For each candidate c Є Ck do begin // scan D for subset count c.count++ Lk = { c Є Ck c.count min_supp }; 2. Description of Pseudo code [4]: 1. Frequent 1-itemsets, L1 are found. Each item is taken in candidate 1-itemsets, C1. The algorithm then scans all of the transactions to count the number of times each item occurs. The set of frequent 1-itemsets, L1 is then computed which consists of the candidate 1-itemsets satisfying minimum support count. 2. The apriori gen procedure [7] generates the candidates and then utilizes Apriori property to remove those itemsets that have an infrequent subset. 3. Then, it is checked if the length of transaction is less than the number of the current iteration. If it returns true, then that corresponding transaction is deleted because those transactions would serve no purpose during scanning. IJETCAS 13-134; 2013, IJETCAS All Rights Reserved Page 205

4. In the next step, database is checked to see how many candidate itemsets are contained in each transaction. If the accumulated count is less than half the number of subsets of each transaction, then that corresponding transaction is deleted. 5. Once the unnecessary transactions are deleted, a subset function is applied to each transaction, in order to search for all subsets of transaction which are candidates and the corresponding counts for each of the candidates are collected. 6. Finally, all those candidates that satisfy minimum support count, constitute the set of frequent itemsets, L. IV. Mining Web Server logs using Extended Apriori algorithm In this work, web log files of the month March 2013 is taken from the web site www.uietkuk.org. This is an engineering college web site and provides information of engineering and PG subjects, faculty details, syllabus, courses offered, etc. This knowledge mined from this web site can be used for web personalization, web site design modification, discovering frequently visited pages by the user etc. Both the data preprocessing steps namely, data cleaning, session identification and user identification and pattern discovery was done by using language JAVA SDK 6.0, Netbeans IDE and Windows XP operating system. Performance comparison with the help of graphs is shown in Figures. 1 and 2 which indicates that Extended Apriori algorithm is more efficient as it takes lesser time in both cases. The major advantage of this approach is that the algorithm results in reduced time in database scans as size of database is greatly reduced. To make the comparison fair, algorithms are implemented using the same data structures in java namely, hash map, array list and Properties, each having its distinct properties. Test data set of same is used having same number of transactions for generating the graph in figure 1whereas test data sets of different sizes are used having different number of transactions for generating the graph in figure 2. Y-axis of both graphs refers to the runnng tme in seconds and X-axis refers to the minimum support constraint in fig 1 and to the number of transactions in fig 2. Figure 1 shows that, as the support count increases, running time of the algorithm decreases. Figure 2 shows that, as number of transactions increases, running time also increases. Figure 1 Plot of Support count vs. Time taken IJETCAS 13-134; 2013, IJETCAS All Rights Reserved Page 206

Figure 2 Plot of No. of transactions vs. Time taken V. Conclusions and future research directions In this paper, the improved version of existing Apriori algorithm known as Extended Apriori algorithm is proposed to overcome the limitations of the basic Apriori algorithm. The new proposed method generates smaller database which reduces the time of database scans. The improved version Apriori algorithm is more efficient which takes less time and hence reflects high efficiency. Present work is on association rule mining but future research will be done using other data mining techniques such as classification, clustering etc. on web server logs to extract meaningful patterns. Also, the proposed method will be compared with the other existing algorithms based on time and space complexity and other tools such as MATLAB will be adopted for its simulation. References [1] G.K. Gupta, Introduction to data mining with case studies, Second edition, 2006. [2] J.S. Park, M.chen, P.S. Yu, An effective hash based algorithm for mining association rules,acm SIGMOD International Conference on Management of Data,May 1995 [3] Jiawei Han, Jian Pei, Yiwen Yin, Mining Frequent Patterns without Candidate Generation, Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 16-18, 2000. [4] Jiawei Hwan and Micheline Kamber, Data Mining: Concepts and techniques, Second edition,april 012. [5] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li, New Algorithmsfor Fast Discovery of Association Rules, The 3rd International Conference on Knowledge Discovery and Data Mining, 1997. [6] Rakesh Agrawal, Tomasz Imielinski, Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C., 1993. [7] Rakesh Agrawal, Ramakrishnan Srikant, Fast algorithms for mining association rules, Proceedings of 20th International Conference on Very Large Data Bases,Chile, September,1994. [8] S.Veeramalai, N.Jaisankar and A.Kanna, Efficient Web Log Mining Using Enhanced Apriori Algorithm with Hash Tree and Fuzzy, International journal of computer science & information Technology (IJCSIT), Vol.2, No.4, August 2010. [9] Suneetha K R and Krishnamoorti R, Web Log Mining using Improved Version of Apriori Algorithm, International Journal of Computer Applications, Volume 29 No.6, September 2011. IJETCAS 13-134; 2013, IJETCAS All Rights Reserved Page 207