An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification

Size: px
Start display at page:

Download "An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification"

Transcription

1 An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification Jingjun Zhu Department of Computer Science and Technology,Tsinghua University Haiyan Wu and Guozhu Gao Department of Computer Science and Technology,Tsinghua University Abstract Web sequential pattern mining is an important way to analyze the access behavior of web users. In this paper, we present an efficient method of web sequential pattern mining based on session filter and transaction identification. Different from traditional mining methods, we categorize the user sessions into human user sessions, crawler sessions and resourcedownload user sessions. Then we filter out the non-human user sessions, leaving the human user sessions for sequential pattern mining. With the purpose of mining users meaningful sequential patterns, we identify users transactions from the user sessions, and do the sequential pattern mining based on transaction level. We present a method of transaction identification based on users access path tree. It can find out all the transactions effectively. We also make some improvements on PrefixSpan algorithm, which can reduce the memory space it takes and avoid generating duplicate projections. The experimental results of our mining method are very satisfactory. Index Terms Web Mining; Sequential Pattern; Session Filter; Transaction Identification; PrefixSpan Algorithm I. INTRODUCTION As the Internet develops rapidly, more and more people go to visit various kinds of websites for getting the information they need. Their behavior has an effect on the website. Web sequential pattern mining is an important way to understand the access behavior of web users. Web sequential patterns are the frequent access subsequences which satisfy a certain user-specified minimum support. They describe the most frequent access sequential relationships of the web pages that people visit. By using web sequential patterns, we can improve the topology of website, so that users can get more information with fewer operations. Besides, web sequential patterns can also help to provide personalized service for the users, which will make the users served better. So web sequential pattern mining has a good prospect. It has become a hot research topic of data mining. Web sequential pattern mining is based on the web access log. Web access log registers the access information of users, including IP address, access time, request URL, referrer, user-agent and so on [2][13]. We can find out all the access sequential patterns from web access log. But the raw data in the web access log can t be mined directly. We should preprocess the web access log at first, and then do the sequential pattern mining with some mining algorithms. Many people worked hard at web sequential mining. Since 1990s, some useful mining techniques and algorithms have been found. They work well in the traditional web environment. But in recent years, there have been some changes on the Internet. One of the changes is that web users are increasingly diversified. Besides the normal users who visit the website by browser (we call them human users ), there are some non-human users. After the year 2000, search engines become more and more popular. They use the crawlers to collect the information of various web pages. So there are many crawlers accessing the websites every day. Their behavior is different from that of human users. Besides the crawlers, we find some users who do not visit any web pages but download the resources from the website. They only request the resource files such as.mp3,.wmv,.rm, etc. This type of users generally downloads the resources by clicking the resource URL provided by other websites or using some download tools. We call them resource-download users. As the P2P download tools such as Xunlei (a.k.a. Thunder), BitTorrent and Flashget being used widely in recent years, the traffic of resource-download user is increasing. This will bring heavy workload to the web server. The behavior of crawlers and resource-download users is different from that of human users. In general, web sequential mining is for human users. The sequential patterns of crawlers and resource-download users are meaningless. Mining them is redundant. And it will reduce the efficiency and accuracy of mining. So we d better filter out the crawler sessions and the resource-download user sessions before mining. But after our investigation, we find that most of the web sequential pattern mining methods do not categorize the user sessions or filter out the useless user sessions before mining. Supported by National High-tech R&D Program (863 Program) under grant No 2007AA010306

2 Besides, most of the web sequential pattern mining methods are based on session level. Each sequence is the click-stream in a user session. But sometimes the user sessions are too long, which will reduce the efficiency of mining algorithm. Furthermore, the click-stream just registers the sequence of web pages the users click. It can t describe the whole access path in a user session. For these reasons, it is necessary to divide a user session into sequences with smaller granularity. These smaller sequences are called transactions. Each transaction represents a meaningful activity of web user, from which we can find out an access path of web user. In this paper, we present an efficient method of web sequential pattern mining based on session filter and transaction identification. This mining method also contains data preprocessing before sequential pattern mining, which includes data cleaning, user identification and session identification. After getting all the user sessions, we will categorize them into human user sessions, crawler sessions and resource-download user sessions. Then we filter out the crawler sessions and resource-download user sessions, leaving the human user sessions for mining. Next we identify the transactions from human user sessions, and construct the transaction database. Finally we mine the sequence patterns in the transaction database with an algorithm improved from PrefixSpan algorithm [7]. We call the improved algorithm CIC-PrefixSpan algorithm. The rest of this paper is organized as follows: Section 2 reviews related work. Section 3 discusses data preprocessing. Section 4 describes the method of session filter. Section 5 presents a method of transaction identification based on users access path tree. Section 6 discusses sequential pattern mining with CIC-PrefixSpan algorithm. Section 7 shows the experimental results of sequential pattern mining. Finally, Section 8 provides conclusions. II. RELATED WORK The sequential pattern mining problem was first introduced by R. Agrawal and R. Srikant [4]. Web sequential pattern mining is one of the application fields of sequential pattern mining. [1] presented the method of data preprocessing, which includes data cleaning, user identification, session identification and path completion. M. Arlitt [9] proposed the definition of session. [10] identified crawlers from web access log and indicated their different behavior from human users. There are several classic algorithms of web sequential pattern mining. R. Agrawal et al. presented AprioriAll algorithm [4] and GSP algorithm [5]. J. Han et al. presented FreeSpan algorithm [6] and PrefixSpan algorithm [7]. The former two algorithms are improvements of Apriori algorithm [3]. But they still have to generate a large number of candidate sequences and scan the sequence database many times during the mining process. The latter two algorithms are both based on database projection. They project the sequence database recursively into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent items. III. DATA PREPROCESSING The aim of data preprocessing is to extract user sessions from raw web access log, which are prepared for session filter. A. Data Cleaning Data cleaning is to clean up the data which have nothing to do with mining [14]. The measures we take are as follows: 1) Delete the access logs with URL suffixes such as gif, jpg, jpeg, js, css, etc. In general, they are automatically downloaded when a user requests a web page, and the user does not explicitly request them. 2) Delete the access logs whose request methods aren t GET or POST. 3) Delete the access logs whose status codes begin with 4 or 5. They are generally caused by request error or server error. B. User Identification User identification is to identify the web access logs of each unique user. We can identify different users according to their IP addresses. But it is not so accurate. Because many different users access the websites through proxy servers. Their IP addresses registered in the web access log are the same. So we need more information to identify different users. The user-agent in the web access log registers the browser software and operating system of web user, which is useful for user identification [12]. It is reasonable to assume that each different user-agent represents a different user. So if the user-agents of two users with the same IP address are different, we treat them as two different users. Only users with the same IP address and the same user-agent are regarded as the same user. C. Session Identification A user session is considered to be all of the page accesses that occur during a single visit to a website. It is likely that users will visit a website many times in a long period of time. And the goal of session identification is to divide the page accesses of each user into individual sessions. We set a timeout to identify different sessions. If the time between two continuous page requests exceeds a certain threshold, it is assumed that the user is starting a new session. In many cases, the time threshold is set to 600 seconds. IV. SESSION FILTER The first step of session filter is to categorize all the user sessions into three types, which are human user sessions, crawler sessions and resource-download user sessions. We make some categorizing rules according to the features of these three types of users in the web access log. The user-agent of crawler is different from that of other users. Table I shows the user-agents of several common crawlers. We can see that the user-agents of these crawlers all contain their names and some key words such as spider, bot, slurp, crawler, etc. These key words can t be found in the user-

3 TABLE I. THE USER-AGENTS OF SEVERAL COMMON CRAWLERS Crawler Name Google crawler Baidu crawler Yahoo crawler Sogou crawler User-Agent Mozilla/5.0 (compatible; Googlebot/2.1; + Baiduspider+(+ Mozilla/5.0 (compatible; Yahoo! Slurp; Sogou web spider/4.0(+ Youdao crawler Mozilla/5.0 (compatible; YoudaoBot/1.0; ) Soso crawler Sosospider+(+ TABLE II. THE STATISTICS OF THE COUNTS OF EACH TYPE OF USERS Website THZY WLXT Request Proportion Session Proportion Sum of Sum of Resource- Resource- Log ID Human Human requests sessions Crawlers download Crawlers download users users users users % 3.5% 45.7% 83.0% 4.5% 12.5% % 3.3% 50.9% 82.0% 4.2% 13.8% % 3.5% 41.1% 82.0% 4.5% 13.5% % 0.3% 6.6% 73.7% 1.8% 24.5% % 0.3% 7.3% 75.4% 2.9% 21.7% % 0.6% 9.4% 63.9% 1.9% 34.2% agents of other users. So if the user-agent of a user contains one or more of these key words, it is assumed that this user is a crawler. And its sessions are categorized as crawler sessions. But there are no key words for resource-download users in the web access log. We must identify them according to their behavior characteristics. The behavior characteristics of resource-download users are very obvious. They only request to download resources and never visit web pages. So if there are no web pages in the set of files requested by the user in a session, we can regard the user session as resource-download user session. Web pages are the files with suffixes such as jsp, php, html, htm, asp, etc. And resource files are usually with suffixes such as mp3, wmv, rmvb, rm, wma, rar, doc, etc. After having identified the crawler sessions and the resource-download user sessions, we can filter them out. The rest of user sessions are considered as human user sessions. We use the human user sessions for transaction identification and sequential pattern mining. We use our categorizing rules to categorize the user sessions of THZY and WLXT, which are two websites of Tsinghua University. We choose three one-day periods of THZY s and WLXT s access logs respectively to analyze. The statistics of counts of each type of users are shown in Table II. We can see that both in request and session counts, the proportion of human users is the largest, the proportion of resource-download users takes the second place, and the proportion of crawlers is the smallest. The proportion of resource-download users is not small. Their request proportion in THZY is even very close to that of human users. In contrast, the proportion of crawlers is very small, which is no more than 5%. By analyzing the sessions of resource-download users, we find that their access behavior is very simple. For singlethreaded download users, they request one or more resource files in a session. Each resource file is requested only once. For multi-threaded download users, they request the same resource file for many times. Each request downloads a different part of the resource file. Usually, their referrers are not the web pages in the website. So there is no access path for resource-download users. We also analyze the access behavior of crawlers. In general, their sessions are very short. On average, each crawler session only contains 3 requests. About 80% of crawler sessions only contain 1 or 2 requests. More than 90% of crawler requests have no referrer. So we can t construct their access path. And they have few similar access sequences. All of these show that mining the web sequential patterns of resource-download users and crawlers is meaningless. For THZY and WLXT, the proportion of crawler sessions and resource-download user sessions is about 20% ~ 30%. We can filter them out, so that the mining efficiency can be improved. V. TRANSACTION IDENTIFICATION After session filter, the human user sessions are left. But they are still too long. We find that some human user sessions in WLXT even contain more than 30 requests. Mining these sessions is a time-consuming process.

4 The aim of transaction identification is to divide a user session into sequences which are shorter and more meaningful. There are some ways to define transaction. We use the MFP (Maximal Forward Path) presented by M. S. Chen et al. in [11] to define transaction. A MFP is the path in a user session from the first page in a forward visit up to the page before a backward visit is made. A user session may contain one or more MFPs. Each MFP is a meaningful sequence of pages. We define each MFP as a transaction. So the transaction identification is actually to find all the MFPs in a user session. We can find the MFPs from a user s access path in a session. In general, the structure of a user s access path looks like a tree. The users usually enter a website from the homepage. And then they click different links on the page to visit several new pages. Each new page forms a new branch in the user s access path. They continue to visit in each branch, and may generate more branches. So we use a structure called access path tree to describe a user s access path. Users access path tree is composed of nodes and directed edges. Nodes represent web pages the users visit, and directed edges represent links between web pages. Every access path tree has a root node, representing the first page in a user session. Every non-leaf node has one or more child nodes, representing the branches in the access path. Figure 1 shows a typical access path tree. It represents the access sequence {A, B(A), C(A), D(A), E(B), F(E), G(C), H(G), I(C), J(D), K(J)}(the page inside the brackets is the referrer of that outside the brackets). F E A B C D H G Figure 1. A typical access path tree The construction of access path tree is to find the parent node of each node according to its referrer. Suppose the access sequence in a user session is {a 1, a 2, a 3,, a n }, we traverse the pages in the access sequence. For page a i (1 i n), we traverse the subsequence {a 1, a 2,, a i-1 } from back to front. If there exists a page a k (1 k i-1) whose URL is the same as the referrer of a i, we will take a k as the parent node of a i. Or else we will ignore a i and deal with the next page. The pseudo-code of constructing an access path tree is shown in Table III. We can find out all the MFPs from users access path tree. Each MFP is the path from the root node to one of the leaf nodes. For example, we can find out 4 MFPs from the access path tree shown in Figure 1. They are {A, B, E, F}, {A, C, G, H}, {A, C, I} and {A, D, J, K}. So the original access I K J sequence can be divided into 4 transactions. Each transaction describes a meaningful activity of web user. And we can see that the length of transaction is much shorter than that of session. TABLE III. THE PSEUDO-CODE OF CONSTRUCTING AN ACCESS PATH TREE Input: User s access sequence L Output: The root node of access path tree for (i = 0; i < L.size(); i++) if (i == 0) root = L[i] else for (k = i 1; k >= 0; k--) if (L[k] s url == L[i] s referrer) L[i] s parent node = L[k] insert L[i] to the end of L[k] s child node list break return root VI. SEQUENTIAL PATTERN MINING After having identified all the transactions, we can construct the transaction database. The task of sequential pattern mining is to find the sequential patterns in the transaction database. These sequential patterns can describe the characteristics of users behavior very well. A. Basic Definitions Firstly we introduce some basic definitions on sequential pattern mining. These basic definitions can be found in [17]. Definition 1: A sequence s, denoted by {s 1, s 2,, s n }, is an ordered list of elements. s i (1 i n) is an element of the sequence. Each element is an itemset, which has one or more items. In this paper, we only discuss the case that each element has only one item. So s i also denotes an item in the sequence in this paper. Definition 2: A sequence α = {a 1, a 2,, a n } is called a subsequence of another sequence β = {b 1, b 2,, b m } if there exist integers 1 j 1 <j 2 < <j n m such that a 1 = b j1, a 2 = b j2,, a n = b jn. In this case, β is called a supersequence of α. Definition 3: A sequence database S is a set of tuples <sid, s>, where sid is a sequence_id and s is a sequence. A tuple <sid, s> is said to contain a sequence α, if α is a subsequence of s. The support of a sequence α in a sequence database S is the number of tuples in the database containing α. It is denoted as support S (α). Besides, the support of sequence α can also be expressed as the percentage of tuples in the database containing α. Definition 4: Given a positive integer min_support as the support threshold, a sequence α is called a sequential pattern in sequence database S if support S (α) min_support. We also call α a frequent sequence in the sequence database. If α has only one item, we call it a frequent item. Web sequential pattern is a special case that each item in a sequential pattern is a web page. B. PrefixSpan Algorithm Mining sequential patterns need to use some mining algorithms. AprioriAll algorithm and GSP algorithm are both

5 Apriori-like algorithms. They may generate a large number of candidate sequences during the mining process. Moreover, they need to scan the sequence database multiple times. So their mining efficiency is very low when the sequence database is very large or the sequential patterns are very long. In contrast, FreeSpan algorithm and PrefixSpan algorithm do not generate any candidate sequences. Furthermore, they use database-projection technique to greatly reduce the scope of sequence scan. So their mining efficiency is much higher than that of AprioriAll algorithm and GSP algorithm. PrefixSpan algorithm is even the improvement of FreeSpan algorithm. So in this paper, we do the sequential pattern mining based on PrefixSpan algorithm. With the purpose of increasing the mining efficiency, we make some improvements on PrefixSpan algorithm. In this part, we give a brief introduction to PrefixSpan algorithm. We first introduce the concept of prefix, suffix and projected database. Definition 1 (Prefix, Suffix): Given a sequence α = {a 1, a 2,, a n }, a sequence β = {b 1, b 2,, b m }(m n) is called a prefix of α if and only if b i = a i for (i m). And we call sequence γ = {a m+1, a m+2,, a n } the suffix of α with regards to prefix β. Definition 2 (Projected database): Let α be a sequential pattern in a sequence database S. The α-projected database, denoted as S α, is the collection of suffixes of sequences in S with regards to prefix α. PrefixSpan algorithm is based on the idea of database projection and sequential pattern-growth. It is described in Table IV. TABLE IV. THE DESCRIPTION OF PREFIXSPAN ALGORITHM Input: A sequence database S and the minimum support threshold min_support Output: The complete set of sequential patterns Method: Call PrefixSpan({}, 0, S) Subroutine PrefixSpan(α, l, S α) The parameters are: α is a sequential pattern; l is the length of α; S α is the α-projected database if α {}, otherwise it is the sequence database S. The procedure of PrefixSpan(α, l, S α) is as follows: 1) Scan S α once, find each frequent item b which satisfies min_support. 2) For each frequent item b, append it to α to form a new sequential pattern α, and output α. 3) For each α, construct α -projected database S α, and call PrefixSpan(α, l+1, S α ), until there is no frequent item found in the projected database. C. Improvements on PrefixSpan Algorithm Although the efficiency of PrefixSpan algorithm is higher than that of other algorithms, it still can be further improved in some respects. First, PrefixSpan algorithm constructs a projected database for each sequential pattern. So if there are a large number of sequential patterns, the size of projected databases will be very large, which will spend a lot of memory. In order to reduce the size of projected databases, J. Pei et al. presented a technique called pseudoprojection [7]. The idea of pseudoprojection is as follows: Instead of performing physical projection by copying the whole suffix as a projected subsequence, one can register the identifier of the corresponding sequence and the starting position of the projected suffix in the sequence. That is because each projected subsequence is just a suffix of a sequence in the original database. We can use a tuple <sequence_id, offset> to represent a projected subsequence. The sequence_id is the identifier of the corresponding sequence, and the offset is the starting position of projection in the sequence (in general it begins with 0). For example, for the sequence s = {a, b, c, d, e, f, g, h}, if its sequence_id is 8, tuple <8, 4> represents the projected subsequence {e, f, g, h}. We use these tuples to represent all the projected subsequences in a projected database, and construct a pseudoprojected database. It is obvious that the pseudoprojected database takes much less space than its corresponding physically projected one. Besides, we find that PrefixSpan algorithm may generate duplicate projections in the process of mining. For example, {a, b, c, d, e, f}, {a, b, c, d, e, g} and {a, b, c, d, f, h} are three sequences in the sequence database. For sequential pattern {a, b}, the {a, b}-projected database contains three sequences: {c, d, e, f}, {c, d, e, g} and {c, d, f, h}. Suppose min_support = 3, then c and d are both frequent items in the projected database. According to PrefixSpan algorithm, we should append c and d to {a, b} to form new sequential pattern {a, b, c} and {a, b, d}. But we can find that d appears behind c in each projected subsequence. So all the d must appear in the {a, b, c}-projected database. And the {a, b, c, d}-projected database is the same as the {a, b, d}-projected database. In this case, the following pattern-growth process of {a, b, c, d} is the same as that of {a, b, d}. Is it necessary to mine both the two projected databases? We can find that each sequential pattern with prefix {a, b, d} must be the subsequence of a sequential pattern with prefix {a, b, c, d}. So the sequential patterns with prefix {a, b, c, d} are more complete. They include the sequential patterns with prefix {a, b, d}. And it is needless to mine the {a, b, d}-projected database. The case of duplicate projections happens quite a few times, especially in web sequential pattern mining. In order to avoid mining duplicate projected databases, we do some filtering on the frequent items found in the projected database. After finding all the frequent items, we scan the projected database once more to register the Forward Frequent ItemSet (FFIS for short) of each frequent item. The FFIS of a frequent item a, denoted as FFIS(a), is a set of tuples <x, counts>. x is a frequent item which appears before a in some sequences, and counts is the number of sequences in which x appears before a. So FFIS(a) can be expressed as follows: FFIS(a) = {<x, counts> x is a frequent item appears before a in some sequences and counts is the number of sequences in which x appears before a}

6 Runtime (in seconds) With the definition of FFIS, we can see that for a frequent item a, each tuple <s, n> FFIS(a) satisfies n support(a). If there exists a tuple <b, m> FFIS(a) such that m = support(a), it means that a appears behind b in each sequence containing a. In this case we will not append a to the current sequential pattern to form a new sequential pattern. So it is needless to construct the corresponding projected database. By doing this filtering on each frequent item, we can avoid generating duplicate projections. In most cases, this improvement can increase the efficiency of PrefixSpan algorithm, although it will take some extra time to register the FFIS of each frequent item. We call the improved PrefixSpan algorithm with the improvements above CIC-PrefixSpan algorithm. We use CIC-PrefixSpan algorithm to do the sequential pattern mining. D. Finding Maximal Sequential Patterns The output of CIC-PrefixSpan algorithm is the complete set of sequential patterns. But in general the number of these sequential patterns is too large. So it is necessary to select the useful and representative ones from all the sequential patterns. Here we use the maximal sequential patterns to represent all the sequential patterns. A sequential pattern s is maximal, if there exists no sequential pattern which is the proper supersequence of s. We know that all the subsequences of a sequential pattern are also frequent. So a maximal sequential pattern can represent all its sub-sequential patterns. And it is more useful for the analysis of user behavior. Furthermore, by filtering out the non-maximal sequential patterns, we can reduce the size of the result set of sequential pattern mining. The last step of sequential pattern mining is to find all the maximal sequential patterns. The method of doing so is shown in Table V. TABLE V. THE METHOD OF FINDING MAXIMAL SEQUENTIAL PATTERNS Suppose the complete set of sequential patterns is S, the maximum length of sequential patterns is n. for (k = n; k > 1; k--) for (each length-k sequential pattern s k) delete all the subsequences of s k from S The rest of the sequential patterns in S are all the maximal sequential patterns. VII. EXPERIMENTAL RESULTS In this section, we validate the efficiency of CIC- PrefixSpan algorithm through experiments. By session filter, we have filtered out the non-human user sessions, leaving only the human user sessions for sequential pattern mining. We choose a one-day period of THZY s and WLXT s human user sessions respectively as the data sets of experiments. Firstly, we identify the transactions from human user sessions, and construct the transaction database for each of the two websites. The numbers and the average lengths of sessions and transactions are shown in Table VI. TABLE VI. Website THE NUMBERS AND THE AVERAGE LENGTHS OF SESSIONS AND TRANSACTIONS The number of sessions The average length of sessions The number of transactions The average length of transactions THZY WLXT Then we study the performance of CIC-PrefixSpan algorithm in comparison with GSP algorithm and PrefixSpan algorithm. We use these three algorithms to mine the sequential patterns in the above two transaction databases. The environment of the experiments is a machine with Intel(R) Xeon(TM) 3.00GHz CPU and 2GB main memory, running Red Hat Enterprise Linux AS release 4 operating system. Three algorithms, GSP algorithm, PrefixSpan algorithm and CIC-PrefixSpan algorithm, are implemented by using C++ language. All the programs are compiled in g The runtime of the three algorithms in mining THZY's transaction database (min_support = 0.02%~1%) GSP PrefixSpan CIC-PrefixSpan Support threshold (in %) Figure 2. The runtime of the three algorithms in mining THZY s transaction database Figure 2 shows the runtime of the three algorithms at different support thresholds in mining THZY s transaction database. The range of support threshold is from 0.02% to 1%. We can see that GSP algorithm is the slowest, and CIC- PrefixSpan algorithm is slightly faster than PrefixSpan algorithm. When min_support > 0.1%, the runtime of the three algorithms is very close to each other. When min_support < 0.1%, the runtime of GSP algorithm begins to increase rapidly, while the runtime of the other two algorithms increases more steadily. And the gap between PrefixSpan algorithm and CIC- PrefixSpan algorithm is still very small. From Table VI we can see that the average length of THZY s transactions is very short (only 1.83). In this case, the efficiency of CIC- PrefixSpan algorithm is very close to that of PrefixSpan algorithm. So the effectiveness of CIC-PrefixSpan algorithm is not obvious, although it can improve the mining efficiency a little.

7 Runtime (in seconds) The runtime of the three algorithms in mining WLXT's transaction database (min_support = 0.1%~10%) GSP PrefixSpan CIC-PrefixSpan Support threshold (in %) Figure 3. The runtime of the three algorithms in mining WLXT s transaction database Figure 3 shows the runtime of the three algorithms at different support thresholds in mining WLXT s transaction database. The range of support threshold is from 0.1% to 10%. We can see that GSP algorithm is still the slowest. As min_support decreases, the runtime of GSP algorithm increases faster and faster. In contrast, the runtime of the other two algorithms increases much more steadily. CIC-PrefixSpan algorithm is more efficient than PrefixSpan algorithm. The gap between them is larger than that in mining THZY s transaction database. In most cases, the runtime of CIC- PrefixSpan algorithm is only about half of that of PrefixSpan algorithm. Table VI shows that the average length of WLXT s transactions is much longer than that of THZY s transactions. So when the sequences are longer, CIC-PrefixSpan algorithm can improve the mining efficiency obviously. Furthermore, in order to validate the accuracy of the mining result, we also register the numbers of maximal sequential patterns mined by the three algorithms at different support thresholds, as shown in Table VII and Table VIII. We can see that the number of maximal sequential patterns mined by CIC-PrefixSpan algorithm is very close to those of maximal sequential patterns mined by PrefixSpan algorithm and GSP algorithm at each support threshold. This indicates that CIC-PrefixSpan algorithm can find out almost all the maximal sequential patterns. TABLE VII. THE NUMBERS OF MAXIMAL SEQUENTIAL PATTERNS AT DIFFERENT SUPPORT THRESHOLDS IN MINING THZY S TRANSACTION DATABASE min_support GSP PrefixSpan CIC-PrefixSpan 1% % % % % % % % TABLE VIII. THE NUMBERS OF MAXIMAL SEQUENTIAL PATTERNS AT DIFFERENT SUPPORT THRESHOLDS IN MINING WLXT S TRANSACTION DATABASE min_support GSP PrefixSpan CIC-PrefixSpan 10% % % % % % % % In summary, our experiments show that CIC-PrefixSpan algorithm has the best performance among the three algorithms tested. It can significantly improve the mining efficiency when the sequences are longer and the support thresholds are lower. And it can also effectively find out the complete set of maximal sequential patterns. VIII. CONCLUSIONS In this paper, we present an efficient method of web sequential pattern mining. This mining method categorizes the user sessions into human user sessions, crawler sessions and resource-download user sessions according to their features. Then it filters out the non-human user sessions, leaving only the human user sessions for mining. In order to find users meaningful sequential patterns, this mining method divides the human user sessions into transactions, and mines the sequential patterns in the transaction database. We use the concept of MFP to define transaction. We also present a method of identifying transactions based on users access path tree. In order to improve the mining efficiency, we make some improvements on PrefixSpan algorithm. The improved algorithm called CIC-PrefixSpan algorithm works quite well in the experiments. Using this mining method can efficiently find out all the useful web sequential patterns. With the development of Internet, the composition of user sessions is more and more complicated. The idea of categorizing user sessions is very important for the analysis of user behavior. It can help to improve the accuracy of analysis. Besides the three types of users mentioned in this paper, there may exist some other types of users that we haven t considered. For non-human users, we can also analyze the characteristics of their behavior. These issues are to be studied in future. REFERENCES [1] R. Cooley, B. Mobasher, J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns[J]. Journal of Knowledge and Information System, 1999: 5~31. [2] R. Cooley, B. Mobasher, J. Srivastava. Web Mining: Information and pattern discovery on the World Wide Web. In International Conference on Tools with Artificial Intelligence, 1997: 558~567. [3] R. Agrawal, R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, 1994: 487~499. [4] R. Agrawal, R. Srikant. Mining Sequential Patterns[C]. Proc 1995 Int Conf Data Engineering(ICDE 95). Taipei: IEEE Computer Society, 1995: 3~14.

8 [5] R. Agrawal, R. Srikant. Mining Sequential Patterns: Generalizations and Performance Improvements[C]. Proc 5th Int Conf Extending Database Technology(EDBT). Avignon: Lecture Notes in Computer Science, 1996: 3~17. [6] J. Han, J. Pei. FreeSpan: Frequent Pattern-projected Sequential Pattern Mining[C]. Proc 2000 Int Conf Knowledge Discovery and Data Mining(KDD 00). Boston: ACM Press, 2000: 355~359. [7] J. Pei, J. Han, et al. Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 6(10): 1~17. [8] J. Pei, J. Han, B. Mortazavi-asl, H. Zhu. Mining access patterns efficiently from web logs. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000: 396~407. [9] M. Arlitt. Characterizing Web User Sessions. SIGMETRICS Performance Evaluation Review, 2000, 28(2): 50~63. [10] M. Dikaiakos, A. Stassopoulou, L. Papageorgiou. Characterizing Crawler Behavior from Web Server Access Logs[C]. Proceedings of the 4th International Conference on E-Commerce and Web Technologies(EC-Web 03). Springer, 2003: 369~378. [11] M. S. Chen, J. S. Park, P. S. Yu. Data Mining for Path Traversal Patterns in a Web Environment[C]. In Proceedings of the 16th International Conference on Distributed Computing Systems, 1996: 385~392. [12] J. Srivastava, R. Cooley, M. Deshpande, et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data[J]. SIGKDD Explorations, ACM SIGKDD, 2000, 1(2): 12~23. [13] M. Arlitt, C. Williamson. Web server workload characterization: The search for invariants. In ACM SIGMETRICS Conference, 1996: 126~137. [14] O. R. Zaiane, M. Xin, J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In: Advances in Digital Libraries Conference, Santa Barbara,1998: 19~29. [15] B. Mobasher, N. Jain, E. Han, J. Srivastava. Web Mining: Pattern discovery from World Wide Web transactions. Technical Report TR , Department of Computer Science, University of Minnesota, Minneapolis, [16] F. M. Facca, P. L. Lanzi. Mining interesting knowledge from weblogs: a survey. Data & Knowledge Engineering, 2005, 53(3): 225~241. [17] J. Han, M. Kamber. Data Mining Concepts and Techniques, Morgan Kaufmann, Jingjun Zhu was born in Jiangmen, Guangdong Province, China in the year He is a graduate student for Master degree in Tsinghua University. His research interests include web mining, web performance evaluation, etc. Haiyan Wu was born in Daqing, Heilongjiang Province, China in the year She is a senior engineer in Tsinghua University. She has PhD degree. Her research interests include network information security, educational informationization, etc. Guozhu Gao was born in Pingyao, Shanxi Province, China in the year He has Master degree. And he is working in Tsinghua University. His research interests include educational informationization, database management technology, etc.

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS by Ramin Afshar B.Sc., University of Alberta, Alberta, 2000 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

More information

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2017) Vol. 6 (3) 213 222 USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS PIOTR OŻDŻYŃSKI, DANUTA ZAKRZEWSKA Institute of Information

More information

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Ching-Huang Yun and Ming-Syan Chen Department of Electrical Engineering National Taiwan

More information

UMCS. Annales UMCS Informatica AI 7 (2007) Data mining techniques for portal participants profiling. Danuta Zakrzewska *, Justyna Kapka

UMCS. Annales UMCS Informatica AI 7 (2007) Data mining techniques for portal participants profiling. Danuta Zakrzewska *, Justyna Kapka Annales Informatica AI 7 (2007) 153-161 Annales Informatica Lublin-Polonia Sectio AI http://www.annales.umcs.lublin.pl/ Data mining techniques for portal participants profiling Danuta Zakrzewska *, Justyna

More information

PROBLEM STATEMENTS. Dr. Suresh Jain 2 2 Department of Computer Engineering, Institute of

PROBLEM STATEMENTS. Dr. Suresh Jain 2 2 Department of Computer Engineering, Institute of Efficient Web Log Mining using Doubly Linked Tree Ratnesh Kumar Jain 1, Dr. R. S. Kasana 1 1 Department of Computer Science & Applications, Dr. H. S. Gour, University, Sagar, MP (India) jratnesh@rediffmail.com,

More information

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management

More information

Fuzzy Cognitive Maps application for Webmining

Fuzzy Cognitive Maps application for Webmining Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Finding Generalized Path Patterns for Web Log Data Mining

Finding Generalized Path Patterns for Web Log Data Mining Finding Generalized Path Patterns for Web Log Data Mining Alex Nanopoulos and Yannis Manolopoulos Data Engineering Lab, Department of Informatics, Aristotle University 54006 Thessaloniki, Greece {alex,manolopo}@delab.csd.auth.gr

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern

Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern Amit Dipchandji Kasliwal #, Dr. Girish S. Katkar * # Malegaon, Nashik, Maharashtra, India * Dept. of Computer Science, Arts,

More information

A Novel Method of Optimizing Website Structure

A Novel Method of Optimizing Website Structure A Novel Method of Optimizing Website Structure Mingjun Li 1, Mingxin Zhang 2, Jinlong Zheng 2 1 School of Computer and Information Engineering, Harbin University of Commerce, Harbin, 150028, China 2 School

More information

Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining

Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining Bonfring International Journal of Data Mining, Vol. 4, No. 1, March 214 1 Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining Alpa Reshamwala and Dr. Sunita Mahajan Abstract--- Computer

More information

Web Service Usage Mining: Mining For Executable Sequences

Web Service Usage Mining: Mining For Executable Sequences 7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 21-23, 2007 266 Web Service Usage Mining: Mining For Executable Sequences MOHSEN JAFARI ASBAGH, HASSAN ABOLHASSANI

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

Applying Data Mining to Wireless Networks

Applying Data Mining to Wireless Networks Applying Data Mining to Wireless Networks CHENG-MING HUANG 1, TZUNG-PEI HONG 2 and SHI-JINN HORNG 3,4 1 Department of Electrical Engineering National Taiwan University of Science and Technology, Taipei,

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Web Usage Data for Web Access Control (WUDWAC)

Web Usage Data for Web Access Control (WUDWAC) Web Usage Data for Web Access Control (WUDWAC) Dr. Selma Elsheikh* Abstract The development and the widespread use of the World Wide Web have made electronic data storage and data distribution possible

More information

Effectively Capturing User Navigation Paths in the Web Using Web Server Logs

Effectively Capturing User Navigation Paths in the Web Using Web Server Logs Effectively Capturing User Navigation Paths in the Web Using Web Server Logs Amithalal Caldera and Yogesh Deshpande School of Computing and Information Technology, College of Science Technology and Engineering,

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Discovering Paths Traversed by Visitors in Web Server Access Logs

Discovering Paths Traversed by Visitors in Web Server Access Logs Discovering Paths Traversed by Visitors in Web Server Access Logs Alper Tugay Mızrak Department of Computer Engineering Bilkent University 06533 Ankara, TURKEY E-mail: mizrak@cs.bilkent.edu.tr Abstract

More information

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of Computer Science Cornell University ABSTRACT We introduce a new algorithm for mining

More information

An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range

An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range 13 IJCSNS International Journal of Computer Science and Network Security, VOL.16 No.1, January 216 An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range Abhilasha

More information

Characterizing Home Pages 1

Characterizing Home Pages 1 Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

A Comprehensive Survey on Sequential Pattern Mining

A Comprehensive Survey on Sequential Pattern Mining A Comprehensive Survey on Sequential Pattern Mining Irfan Khan 1 Department of computer Application, S.A.T.I. Vidisha, (M.P.), India Anoop Jain 2 Department of computer Application, S.A.T.I. Vidisha, (M.P.),

More information

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Daniel Mican, Nicolae Tomai Babes-Bolyai University, Dept. of Business Information Systems, Str. Theodor

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Sequential Pattern Mining Methods: A Snap Shot

Sequential Pattern Mining Methods: A Snap Shot IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-661, p- ISSN: 2278-8727Volume 1, Issue 4 (Mar. - Apr. 213), PP 12-2 Sequential Pattern Mining Methods: A Snap Shot Niti Desai 1, Amit Ganatra

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining Mona S. Kamat 1, J. W. Bakal 2 & Madhu Nashipudi 3 1,3 Information Technology Department, Pillai Institute Of Information

More information

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1 Volume 4, No. 5, May 2013 (Special Issue) International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info Association Rule Mining among web pages for Discovering

More information

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Mohd Helmy Ab Wahab 1, Azizul Azhar Ramli 2, Nureize Arbaiy 3, Zurinah Suradi 4 1 Faculty of Electrical

More information

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management

More information

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management

More information

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

SeqIndex: Indexing Sequences by Sequential Pattern Analysis SeqIndex: Indexing Sequences by Sequential Pattern Analysis Hong Cheng Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {hcheng3, xyan, hanj}@cs.uiuc.edu

More information

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining Long Wang and Christoph Meinel Computer Department, Trier University, 54286 Trier, Germany {wang, meinel@}ti.uni-trier.de Abstract.

More information

I. Introduction II. Keywords- Pre-processing, Cleaning, Null Values, Webmining, logs

I. Introduction II. Keywords- Pre-processing, Cleaning, Null Values, Webmining, logs ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: An Enhanced Pre-Processing Research Framework for Web Log Data

More information

Pattern Classification based on Web Usage Mining using Neural Network Technique

Pattern Classification based on Web Usage Mining using Neural Network Technique International Journal of Computer Applications (975 8887) Pattern Classification based on Web Usage Mining using Neural Network Technique Er. Romil V Patel PIET, VADODARA Dheeraj Kumar Singh, PIET, VADODARA

More information

The influence of caching on web usage mining

The influence of caching on web usage mining The influence of caching on web usage mining J. Huysmans 1, B. Baesens 1,2 & J. Vanthienen 1 1 Department of Applied Economic Sciences, K.U.Leuven, Belgium 2 School of Management, University of Southampton,

More information

An Algorithm for Mining Large Sequences in Databases

An Algorithm for Mining Large Sequences in Databases 149 An Algorithm for Mining Large Sequences in Databases Bharat Bhasker, Indian Institute of Management, Lucknow, India, bhasker@iiml.ac.in ABSTRACT Frequent sequence mining is a fundamental and essential

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Generating Cross level Rules: An automated approach

Generating Cross level Rules: An automated approach Generating Cross level Rules: An automated approach Ashok 1, Sonika Dhingra 1 1HOD, Dept of Software Engg.,Bhiwani Institute of Technology, Bhiwani, India 1M.Tech Student, Dept of Software Engg.,Bhiwani

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Parallel Implementation of Apriori Algorithm Based on MapReduce

Parallel Implementation of Apriori Algorithm Based on MapReduce International Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), 89-96 Parallel Implementation of Apriori Algorithm Based on MapReduce Ning Li * The Key Laboratory of Intelligent

More information

The Fuzzy Search for Association Rules with Interestingness Measure

The Fuzzy Search for Association Rules with Interestingness Measure The Fuzzy Search for Association Rules with Interestingness Measure Phaichayon Kongchai, Nittaya Kerdprasop, and Kittisak Kerdprasop Abstract Association rule are important to retailers as a source of

More information

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web Web Usage Mining Overview Session 1 This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web 1 Outline 1. Introduction 2. Preprocessing 3. Analysis 2 Example

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm International Journal of Scientific & Engineering Research Volume 4, Issue3, arch-2013 1 Improving the Efficiency of Web Usage ining Using K-Apriori and FP-Growth Algorithm rs.r.kousalya, s.k.suguna, Dr.V.

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

PSEUDO PROJECTION BASED APPROACH TO DISCOVERTIME INTERVAL SEQUENTIAL PATTERN

PSEUDO PROJECTION BASED APPROACH TO DISCOVERTIME INTERVAL SEQUENTIAL PATTERN PSEUDO PROJECTION BASED APPROACH TO DISCOVERTIME INTERVAL SEQUENTIAL PATTERN Dvijesh Bhatt Department of Information Technology, Institute of Technology, Nirma University Gujarat,( India) ABSTRACT Data

More information

Effective Mining Sequential Pattern by Last Position Induction

Effective Mining Sequential Pattern by Last Position Induction Effective Mining Sequential Pattern by Last Position Induction Zhenglu Yang and Masaru Kitsuregawa The University of Tokyo Institute of Industrial Science 4-6-1 Komaba, Meguro-Ku Tokyo 153-8305, Japan

More information

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE K. Abirami 1 and P. Mayilvaganan 2 1 School of Computing Sciences Vels University, Chennai, India 2 Department of MCA, School

More information

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management

More information

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES Prof. Ambarish S. Durani 1 and Mrs. Rashmi B. Sune 2 1 Assistant Professor, Datta Meghe Institute of Engineering,

More information

Data Mining of Web Access Logs Using Classification Techniques

Data Mining of Web Access Logs Using Classification Techniques Data Mining of Web Logs Using Classification Techniques Md. Azam 1, Asst. Prof. Md. Tabrez Nafis 2 1 M.Tech Scholar, Department of Computer Science & Engineering, Al-Falah School of Engineering & Technology,

More information

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, Uttarkhand, India. bnkeshav123@gmail.com, mitusuec@iitr.ernet.in,

More information

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients Web Usage Mining: How to Efficiently Manage New Transactions and New Clients F. Masseglia 1,2, P. Poncelet 2, and M. Teisseire 2 1 Laboratoire PRiSM, Univ. de Versailles, 45 Avenue des Etats-Unis, 78035

More information

Web Usage Mining: An Incremental Positive and Negative Association Rule Mining Approach Anuradha veleti #, T.Nagalakshmi *

Web Usage Mining: An Incremental Positive and Negative Association Rule Mining Approach Anuradha veleti #, T.Nagalakshmi * Web Usage Mining: An Incremental Positive and Negative Association Rule Mining Approach Anuradha veleti #, T.Nagalakshmi * # Department of computer science and engineering Aurora s Technological and Research

More information

Web Usage Mining: A Research Area in Web Mining

Web Usage Mining: A Research Area in Web Mining Web Usage Mining: A Research Area in Web Mining Rajni Pamnani, Pramila Chawan Department of computer technology, VJTI University, Mumbai Abstract Web usage mining is a main research area in Web mining

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH International Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 27-32 DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY)

More information

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition S.Vigneswaran 1, M.Yashothai 2 1 Research Scholar (SRF), Anna University, Chennai.

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

Algorithm for Efficient Multilevel Association Rule Mining

Algorithm for Efficient Multilevel Association Rule Mining Algorithm for Efficient Multilevel Association Rule Mining Pratima Gautam Department of computer Applications MANIT, Bhopal Abstract over the years, a variety of algorithms for finding frequent item sets

More information

THE STUDY OF WEB MINING - A SURVEY

THE STUDY OF WEB MINING - A SURVEY THE STUDY OF WEB MINING - A SURVEY Ashish Gupta, Anil Khandekar Abstract over the year s web mining is the very fast growing research field. Web mining contains two research areas: Data mining and World

More information

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India

More information

Performance Analysis of Data Mining Algorithms

Performance Analysis of Data Mining Algorithms ! Performance Analysis of Data Mining Algorithms Poonam Punia Ph.D Research Scholar Deptt. of Computer Applications Singhania University, Jhunjunu (Raj.) poonamgill25@gmail.com Surender Jangra Deptt. of

More information

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Survey: Efficent tree based structure for mining frequent pattern from transactional databases IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 5 (Mar. - Apr. 2013), PP 75-81 Survey: Efficent tree based structure for mining frequent pattern from

More information

A Hierarchical Document Clustering Approach with Frequent Itemsets

A Hierarchical Document Clustering Approach with Frequent Itemsets A Hierarchical Document Clustering Approach with Frequent Itemsets Cheng-Jhe Lee, Chiun-Chieh Hsu, and Da-Ren Chen Abstract In order to effectively retrieve required information from the large amount of

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Sequence Data Sequence Database: Timeline 10 15 20 25 30 35 Object Timestamp Events A 10 2, 3, 5 A 20 6, 1 A 23 1 B 11 4, 5, 6 B

More information

Survey Paper on Web Usage Mining for Web Personalization

Survey Paper on Web Usage Mining for Web Personalization ISSN 2278 0211 (Online) Survey Paper on Web Usage Mining for Web Personalization Namdev Anwat Department of Computer Engineering Matoshri College of Engineering & Research Center, Eklahare, Nashik University

More information

International Journal of Scientific Research and Reviews

International Journal of Scientific Research and Reviews Research article Available online www.ijsrr.org ISSN: 2279 0543 International Journal of Scientific Research and Reviews A Survey of Sequential Rule Mining Algorithms Sachdev Neetu and Tapaswi Namrata

More information

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees Mining Web Access Patterns with First-Occurrence Linked WAP-Trees Peiyi Tang Markus P. Turkia Kyle A. Gallivan Dept of Computer Science Dept of Computer Science School of Computational Science Univ of

More information

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Mining User - Aware Rare Sequential Topic Pattern in Document Streams Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,

More information

Comparing the Performance of Frequent Itemsets Mining Algorithms

Comparing the Performance of Frequent Itemsets Mining Algorithms Comparing the Performance of Frequent Itemsets Mining Algorithms Kalash Dave 1, Mayur Rathod 2, Parth Sheth 3, Avani Sakhapara 4 UG Student, Dept. of I.T., K.J.Somaiya College of Engineering, Mumbai, India

More information

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai EFFICIENTLY MINING FREQUENT ITEMSETS IN TRANSACTIONAL DATABASES This article has been peer reviewed and accepted for publication in JMST but has not yet been copyediting, typesetting, pagination and proofreading

More information

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences Antonio Gomariz 1,, Manuel Campos 2,RoqueMarin 1, and Bart Goethals 3 1 Information and Communication Engineering Dept., University of

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Improved Apriori Algorithm was Applied in the System of Elective Courses in Colleges and Universities

More information

A Literature Review of Modern Association Rule Mining Techniques

A Literature Review of Modern Association Rule Mining Techniques A Literature Review of Modern Association Rule Mining Techniques Rupa Rajoriya, Prof. Kailash Patidar Computer Science & engineering SSSIST Sehore, India rprajoriya21@gmail.com Abstract:-Data mining is

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

High Utility Web Access Patterns Mining from Distributed Databases

High Utility Web Access Patterns Mining from Distributed Databases High Utility Web Access Patterns Mining from Distributed Databases Md.Azam Hosssain 1, Md.Mamunur Rashid 1, Byeong-Soo Jeong 1, Ho-Jin Choi 2 1 Database Lab, Department of Computer Engineering, Kyung Hee

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth PrefixSpan: ining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jian Pei Jiawei Han Behzad ortazavi-asl Helen Pinto Intelligent Database Systems Research Lab. School of Computing Science

More information