An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification

Size: px

Start display at page:

Download "An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification"

Arnold Lyons
5 years ago
Views:

1 An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification Jingjun Zhu Department of Computer Science and Technology,Tsinghua University Haiyan Wu and Guozhu Gao Department of Computer Science and Technology,Tsinghua University Abstract Web sequential pattern mining is an important way to analyze the access behavior of web users. In this paper, we present an efficient method of web sequential pattern mining based on session filter and transaction identification. Different from traditional mining methods, we categorize the user sessions into human user sessions, crawler sessions and resourcedownload user sessions. Then we filter out the non-human user sessions, leaving the human user sessions for sequential pattern mining. With the purpose of mining users meaningful sequential patterns, we identify users transactions from the user sessions, and do the sequential pattern mining based on transaction level. We present a method of transaction identification based on users access path tree. It can find out all the transactions effectively. We also make some improvements on PrefixSpan algorithm, which can reduce the memory space it takes and avoid generating duplicate projections. The experimental results of our mining method are very satisfactory. Index Terms Web Mining; Sequential Pattern; Session Filter; Transaction Identification; PrefixSpan Algorithm I. INTRODUCTION As the Internet develops rapidly, more and more people go to visit various kinds of websites for getting the information they need. Their behavior has an effect on the website. Web sequential pattern mining is an important way to understand the access behavior of web users. Web sequential patterns are the frequent access subsequences which satisfy a certain user-specified minimum support. They describe the most frequent access sequential relationships of the web pages that people visit. By using web sequential patterns, we can improve the topology of website, so that users can get more information with fewer operations. Besides, web sequential patterns can also help to provide personalized service for the users, which will make the users served better. So web sequential pattern mining has a good prospect. It has become a hot research topic of data mining. Web sequential pattern mining is based on the web access log. Web access log registers the access information of users, including IP address, access time, request URL, referrer, user-agent and so on [2][13]. We can find out all the access sequential patterns from web access log. But the raw data in the web access log can t be mined directly. We should preprocess the web access log at first, and then do the sequential pattern mining with some mining algorithms. Many people worked hard at web sequential mining. Since 1990s, some useful mining techniques and algorithms have been found. They work well in the traditional web environment. But in recent years, there have been some changes on the Internet. One of the changes is that web users are increasingly diversified. Besides the normal users who visit the website by browser (we call them human users ), there are some non-human users. After the year 2000, search engines become more and more popular. They use the crawlers to collect the information of various web pages. So there are many crawlers accessing the websites every day. Their behavior is different from that of human users. Besides the crawlers, we find some users who do not visit any web pages but download the resources from the website. They only request the resource files such as.mp3,.wmv,.rm, etc. This type of users generally downloads the resources by clicking the resource URL provided by other websites or using some download tools. We call them resource-download users. As the P2P download tools such as Xunlei (a.k.a. Thunder), BitTorrent and Flashget being used widely in recent years, the traffic of resource-download user is increasing. This will bring heavy workload to the web server. The behavior of crawlers and resource-download users is different from that of human users. In general, web sequential mining is for human users. The sequential patterns of crawlers and resource-download users are meaningless. Mining them is redundant. And it will reduce the efficiency and accuracy of mining. So we d better filter out the crawler sessions and the resource-download user sessions before mining. But after our investigation, we find that most of the web sequential pattern mining methods do not categorize the user sessions or filter out the useless user sessions before mining. Supported by National High-tech R&D Program (863 Program) under grant No 2007AA010306

2 Besides, most of the web sequential pattern mining methods are based on session level. Each sequence is the click-stream in a user session. But sometimes the user sessions are too long, which will reduce the efficiency of mining algorithm. Furthermore, the click-stream just registers the sequence of web pages the users click. It can t describe the whole access path in a user session. For these reasons, it is necessary to divide a user session into sequences with smaller granularity. These smaller sequences are called transactions. Each transaction represents a meaningful activity of web user, from which we can find out an access path of web user. In this paper, we present an efficient method of web sequential pattern mining based on session filter and transaction identification. This mining method also contains data preprocessing before sequential pattern mining, which includes data cleaning, user identification and session identification. After getting all the user sessions, we will categorize them into human user sessions, crawler sessions and resource-download user sessions. Then we filter out the crawler sessions and resource-download user sessions, leaving the human user sessions for mining. Next we identify the transactions from human user sessions, and construct the transaction database. Finally we mine the sequence patterns in the transaction database with an algorithm improved from PrefixSpan algorithm [7]. We call the improved algorithm CIC-PrefixSpan algorithm. The rest of this paper is organized as follows: Section 2 reviews related work. Section 3 discusses data preprocessing. Section 4 describes the method of session filter. Section 5 presents a method of transaction identification based on users access path tree. Section 6 discusses sequential pattern mining with CIC-PrefixSpan algorithm. Section 7 shows the experimental results of sequential pattern mining. Finally, Section 8 provides conclusions. II. RELATED WORK The sequential pattern mining problem was first introduced by R. Agrawal and R. Srikant [4]. Web sequential pattern mining is one of the application fields of sequential pattern mining. [1] presented the method of data preprocessing, which includes data cleaning, user identification, session identification and path completion. M. Arlitt [9] proposed the definition of session. [10] identified crawlers from web access log and indicated their different behavior from human users. There are several classic algorithms of web sequential pattern mining. R. Agrawal et al. presented AprioriAll algorithm [4] and GSP algorithm [5]. J. Han et al. presented FreeSpan algorithm [6] and PrefixSpan algorithm [7]. The former two algorithms are improvements of Apriori algorithm [3]. But they still have to generate a large number of candidate sequences and scan the sequence database many times during the mining process. The latter two algorithms are both based on database projection. They project the sequence database recursively into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent items. III. DATA PREPROCESSING The aim of data preprocessing is to extract user sessions from raw web access log, which are prepared for session filter. A. Data Cleaning Data cleaning is to clean up the data which have nothing to do with mining [14]. The measures we take are as follows: 1) Delete the access logs with URL suffixes such as gif, jpg, jpeg, js, css, etc. In general, they are automatically downloaded when a user requests a web page, and the user does not explicitly request them. 2) Delete the access logs whose request methods aren t GET or POST. 3) Delete the access logs whose status codes begin with 4 or 5. They are generally caused by request error or server error. B. User Identification User identification is to identify the web access logs of each unique user. We can identify different users according to their IP addresses. But it is not so accurate. Because many different users access the websites through proxy servers. Their IP addresses registered in the web access log are the same. So we need more information to identify different users. The user-agent in the web access log registers the browser software and operating system of web user, which is useful for user identification [12]. It is reasonable to assume that each different user-agent represents a different user. So if the user-agents of two users with the same IP address are different, we treat them as two different users. Only users with the same IP address and the same user-agent are regarded as the same user. C. Session Identification A user session is considered to be all of the page accesses that occur during a single visit to a website. It is likely that users will visit a website many times in a long period of time. And the goal of session identification is to divide the page accesses of each user into individual sessions. We set a timeout to identify different sessions. If the time between two continuous page requests exceeds a certain threshold, it is assumed that the user is starting a new session. In many cases, the time threshold is set to 600 seconds. IV. SESSION FILTER The first step of session filter is to categorize all the user sessions into three types, which are human user sessions, crawler sessions and resource-download user sessions. We make some categorizing rules according to the features of these three types of users in the web access log. The user-agent of crawler is different from that of other users. Table I shows the user-agents of several common crawlers. We can see that the user-agents of these crawlers all contain their names and some key words such as spider, bot, slurp, crawler, etc. These key words can t be found in the user-

3 TABLE I. THE USER-AGENTS OF SEVERAL COMMON CRAWLERS Crawler Name Google crawler Baidu crawler Yahoo crawler Sogou crawler User-Agent Mozilla/5.0 (compatible; Googlebot/2.1; + Baiduspider+(+ Mozilla/5.0 (compatible; Yahoo! Slurp; Sogou web spider/4.0(+ Youdao crawler Mozilla/5.0 (compatible; YoudaoBot/1.0; ) Soso crawler Sosospider+(+ TABLE II. THE STATISTICS OF THE COUNTS OF EACH TYPE OF USERS Website THZY WLXT Request Proportion Session Proportion Sum of Sum of Resource- Resource- Log ID Human Human requests sessions Crawlers download Crawlers download users users users users % 3.5% 45.7% 83.0% 4.5% 12.5% % 3.3% 50.9% 82.0% 4.2% 13.8% % 3.5% 41.1% 82.0% 4.5% 13.5% % 0.3% 6.6% 73.7% 1.8% 24.5% % 0.3% 7.3% 75.4% 2.9% 21.7% % 0.6% 9.4% 63.9% 1.9% 34.2% agents of other users. So if the user-agent of a user contains one or more of these key words, it is assumed that this user is a crawler. And its sessions are categorized as crawler sessions. But there are no key words for resource-download users in the web access log. We must identify them according to their behavior characteristics. The behavior characteristics of resource-download users are very obvious. They only request to download resources and never visit web pages. So if there are no web pages in the set of files requested by the user in a session, we can regard the user session as resource-download user session. Web pages are the files with suffixes such as jsp, php, html, htm, asp, etc. And resource files are usually with suffixes such as mp3, wmv, rmvb, rm, wma, rar, doc, etc. After having identified the crawler sessions and the resource-download user sessions, we can filter them out. The rest of user sessions are considered as human user sessions. We use the human user sessions for transaction identification and sequential pattern mining. We use our categorizing rules to categorize the user sessions of THZY and WLXT, which are two websites of Tsinghua University. We choose three one-day periods of THZY s and WLXT s access logs respectively to analyze. The statistics of counts of each type of users are shown in Table II. We can see that both in request and session counts, the proportion of human users is the largest, the proportion of resource-download users takes the second place, and the proportion of crawlers is the smallest. The proportion of resource-download users is not small. Their request proportion in THZY is even very close to that of human users. In contrast, the proportion of crawlers is very small, which is no more than 5%. By analyzing the sessions of resource-download users, we find that their access behavior is very simple. For singlethreaded download users, they request one or more resource files in a session. Each resource file is requested only once. For multi-threaded download users, they request the same resource file for many times. Each request downloads a different part of the resource file. Usually, their referrers are not the web pages in the website. So there is no access path for resource-download users. We also analyze the access behavior of crawlers. In general, their sessions are very short. On average, each crawler session only contains 3 requests. About 80% of crawler sessions only contain 1 or 2 requests. More than 90% of crawler requests have no referrer. So we can t construct their access path. And they have few similar access sequences. All of these show that mining the web sequential patterns of resource-download users and crawlers is meaningless. For THZY and WLXT, the proportion of crawler sessions and resource-download user sessions is about 20% ~ 30%. We can filter them out, so that the mining efficiency can be improved. V. TRANSACTION IDENTIFICATION After session filter, the human user sessions are left. But they are still too long. We find that some human user sessions in WLXT even contain more than 30 requests. Mining these sessions is a time-consuming process.

4 The aim of transaction identification is to divide a user session into sequences which are shorter and more meaningful. There are some ways to define transaction. We use the MFP (Maximal Forward Path) presented by M. S. Chen et al. in [11] to define transaction. A MFP is the path in a user session from the first page in a forward visit up to the page before a backward visit is made. A user session may contain one or more MFPs. Each MFP is a meaningful sequence of pages. We define each MFP as a transaction. So the transaction identification is actually to find all the MFPs in a user session. We can find the MFPs from a user s access path in a session. In general, the structure of a user s access path looks like a tree. The users usually enter a website from the homepage. And then they click different links on the page to visit several new pages. Each new page forms a new branch in the user s access path. They continue to visit in each branch, and may generate more branches. So we use a structure called access path tree to describe a user s access path. Users access path tree is composed of nodes and directed edges. Nodes represent web pages the users visit, and directed edges represent links between web pages. Every access path tree has a root node, representing the first page in a user session. Every non-leaf node has one or more child nodes, representing the branches in the access path. Figure 1 shows a typical access path tree. It represents the access sequence {A, B(A), C(A), D(A), E(B), F(E), G(C), H(G), I(C), J(D), K(J)}(the page inside the brackets is the referrer of that outside the brackets). F E A B C D H G Figure 1. A typical access path tree The construction of access path tree is to find the parent node of each node according to its referrer. Suppose the access sequence in a user session is {a 1, a 2, a 3,, a n }, we traverse the pages in the access sequence. For page a i (1 i n), we traverse the subsequence {a 1, a 2,, a i-1 } from back to front. If there exists a page a k (1 k i-1) whose URL is the same as the referrer of a i, we will take a k as the parent node of a i. Or else we will ignore a i and deal with the next page. The pseudo-code of constructing an access path tree is shown in Table III. We can find out all the MFPs from users access path tree. Each MFP is the path from the root node to one of the leaf nodes. For example, we can find out 4 MFPs from the access path tree shown in Figure 1. They are {A, B, E, F}, {A, C, G, H}, {A, C, I} and {A, D, J, K}. So the original access I K J sequence can be divided into 4 transactions. Each transaction describes a meaningful activity of web user. And we can see that the length of transaction is much shorter than that of session. TABLE III. THE PSEUDO-CODE OF CONSTRUCTING AN ACCESS PATH TREE Input: User s access sequence L Output: The root node of access path tree for (i = 0; i < L.size(); i++) if (i == 0) root = L[i] else for (k = i 1; k >= 0; k--) if (L[k] s url == L[i] s referrer) L[i] s parent node = L[k] insert L[i] to the end of L[k] s child node list break return root VI. SEQUENTIAL PATTERN MINING After having identified all the transactions, we can construct the transaction database. The task of sequential pattern mining is to find the sequential patterns in the transaction database. These sequential patterns can describe the characteristics of users behavior very well. A. Basic Definitions Firstly we introduce some basic definitions on sequential pattern mining. These basic definitions can be found in [17]. Definition 1: A sequence s, denoted by {s 1, s 2,, s n }, is an ordered list of elements. s i (1 i n) is an element of the sequence. Each element is an itemset, which has one or more items. In this paper, we only discuss the case that each element has only one item. So s i also denotes an item in the sequence in this paper. Definition 2: A sequence α = {a 1, a 2,, a n } is called a subsequence of another sequence β = {b 1, b 2,, b m } if there exist integers 1 j 1 <j 2 < <j n m such that a 1 = b j1, a 2 = b j2,, a n = b jn. In this case, β is called a supersequence of α. Definition 3: A sequence database S is a set of tuples <sid, s>, where sid is a sequence_id and s is a sequence. A tuple <sid, s> is said to contain a sequence α, if α is a subsequence of s. The support of a sequence α in a sequence database S is the number of tuples in the database containing α. It is denoted as support S (α). Besides, the support of sequence α can also be expressed as the percentage of tuples in the database containing α. Definition 4: Given a positive integer min_support as the support threshold, a sequence α is called a sequential pattern in sequence database S if support S (α) min_support. We also call α a frequent sequence in the sequence database. If α has only one item, we call it a frequent item. Web sequential pattern is a special case that each item in a sequential pattern is a web page. B. PrefixSpan Algorithm Mining sequential patterns need to use some mining algorithms. AprioriAll algorithm and GSP algorithm are both

5 Apriori-like algorithms. They may generate a large number of candidate sequences during the mining process. Moreover, they need to scan the sequence database multiple times. So their mining efficiency is very low when the sequence database is very large or the sequential patterns are very long. In contrast, FreeSpan algorithm and PrefixSpan algorithm do not generate any candidate sequences. Furthermore, they use database-projection technique to greatly reduce the scope of sequence scan. So their mining efficiency is much higher than that of AprioriAll algorithm and GSP algorithm. PrefixSpan algorithm is even the improvement of FreeSpan algorithm. So in this paper, we do the sequential pattern mining based on PrefixSpan algorithm. With the purpose of increasing the mining efficiency, we make some improvements on PrefixSpan algorithm. In this part, we give a brief introduction to PrefixSpan algorithm. We first introduce the concept of prefix, suffix and projected database. Definition 1 (Prefix, Suffix): Given a sequence α = {a 1, a 2,, a n }, a sequence β = {b 1, b 2,, b m }(m n) is called a prefix of α if and only if b i = a i for (i m). And we call sequence γ = {a m+1, a m+2,, a n } the suffix of α with regards to prefix β. Definition 2 (Projected database): Let α be a sequential pattern in a sequence database S. The α-projected database, denoted as S α, is the collection of suffixes of sequences in S with regards to prefix α. PrefixSpan algorithm is based on the idea of database projection and sequential pattern-growth. It is described in Table IV. TABLE IV. THE DESCRIPTION OF PREFIXSPAN ALGORITHM Input: A sequence database S and the minimum support threshold min_support Output: The complete set of sequential patterns Method: Call PrefixSpan({}, 0, S) Subroutine PrefixSpan(α, l, S α) The parameters are: α is a sequential pattern; l is the length of α; S α is the α-projected database if α {}, otherwise it is the sequence database S. The procedure of PrefixSpan(α, l, S α) is as follows: 1) Scan S α once, find each frequent item b which satisfies min_support. 2) For each frequent item b, append it to α to form a new sequential pattern α, and output α. 3) For each α, construct α -projected database S α, and call PrefixSpan(α, l+1, S α ), until there is no frequent item found in the projected database. C. Improvements on PrefixSpan Algorithm Although the efficiency of PrefixSpan algorithm is higher than that of other algorithms, it still can be further improved in some respects. First, PrefixSpan algorithm constructs a projected database for each sequential pattern. So if there are a large number of sequential patterns, the size of projected databases will be very large, which will spend a lot of memory. In order to reduce the size of projected databases, J. Pei et al. presented a technique called pseudoprojection [7]. The idea of pseudoprojection is as follows: Instead of performing physical projection by copying the whole suffix as a projected subsequence, one can register the identifier of the corresponding sequence and the starting position of the projected suffix in the sequence. That is because each projected subsequence is just a suffix of a sequence in the original database. We can use a tuple <sequence_id, offset> to represent a projected subsequence. The sequence_id is the identifier of the corresponding sequence, and the offset is the starting position of projection in the sequence (in general it begins with 0). For example, for the sequence s = {a, b, c, d, e, f, g, h}, if its sequence_id is 8, tuple <8, 4> represents the projected subsequence {e, f, g, h}. We use these tuples to represent all the projected subsequences in a projected database, and construct a pseudoprojected database. It is obvious that the pseudoprojected database takes much less space than its corresponding physically projected one. Besides, we find that PrefixSpan algorithm may generate duplicate projections in the process of mining. For example, {a, b, c, d, e, f}, {a, b, c, d, e, g} and {a, b, c, d, f, h} are three sequences in the sequence database. For sequential pattern {a, b}, the {a, b}-projected database contains three sequences: {c, d, e, f}, {c, d, e, g} and {c, d, f, h}. Suppose min_support = 3, then c and d are both frequent items in the projected database. According to PrefixSpan algorithm, we should append c and d to {a, b} to form new sequential pattern {a, b, c} and {a, b, d}. But we can find that d appears behind c in each projected subsequence. So all the d must appear in the {a, b, c}-projected database. And the {a, b, c, d}-projected database is the same as the {a, b, d}-projected database. In this case, the following pattern-growth process of {a, b, c, d} is the same as that of {a, b, d}. Is it necessary to mine both the two projected databases? We can find that each sequential pattern with prefix {a, b, d} must be the subsequence of a sequential pattern with prefix {a, b, c, d}. So the sequential patterns with prefix {a, b, c, d} are more complete. They include the sequential patterns with prefix {a, b, d}. And it is needless to mine the {a, b, d}-projected database. The case of duplicate projections happens quite a few times, especially in web sequential pattern mining. In order to avoid mining duplicate projected databases, we do some filtering on the frequent items found in the projected database. After finding all the frequent items, we scan the projected database once more to register the Forward Frequent ItemSet (FFIS for short) of each frequent item. The FFIS of a frequent item a, denoted as FFIS(a), is a set of tuples <x, counts>. x is a frequent item which appears before a in some sequences, and counts is the number of sequences in which x appears before a. So FFIS(a) can be expressed as follows: FFIS(a) = {<x, counts> x is a frequent item appears before a in some sequences and counts is the number of sequences in which x appears before a}

6 Runtime (in seconds) With the definition of FFIS, we can see that for a frequent item a, each tuple <s, n> FFIS(a) satisfies n support(a). If there exists a tuple <b, m> FFIS(a) such that m = support(a), it means that a appears behind b in each sequence containing a. In this case we will not append a to the current sequential pattern to form a new sequential pattern. So it is needless to construct the corresponding projected database. By doing this filtering on each frequent item, we can avoid generating duplicate projections. In most cases, this improvement can increase the efficiency of PrefixSpan algorithm, although it will take some extra time to register the FFIS of each frequent item. We call the improved PrefixSpan algorithm with the improvements above CIC-PrefixSpan algorithm. We use CIC-PrefixSpan algorithm to do the sequential pattern mining. D. Finding Maximal Sequential Patterns The output of CIC-PrefixSpan algorithm is the complete set of sequential patterns. But in general the number of these sequential patterns is too large. So it is necessary to select the useful and representative ones from all the sequential patterns. Here we use the maximal sequential patterns to represent all the sequential patterns. A sequential pattern s is maximal, if there exists no sequential pattern which is the proper supersequence of s. We know that all the subsequences of a sequential pattern are also frequent. So a maximal sequential pattern can represent all its sub-sequential patterns. And it is more useful for the analysis of user behavior. Furthermore, by filtering out the non-maximal sequential patterns, we can reduce the size of the result set of sequential pattern mining. The last step of sequential pattern mining is to find all the maximal sequential patterns. The method of doing so is shown in Table V. TABLE V. THE METHOD OF FINDING MAXIMAL SEQUENTIAL PATTERNS Suppose the complete set of sequential patterns is S, the maximum length of sequential patterns is n. for (k = n; k > 1; k--) for (each length-k sequential pattern s k) delete all the subsequences of s k from S The rest of the sequential patterns in S are all the maximal sequential patterns. VII. EXPERIMENTAL RESULTS In this section, we validate the efficiency of CIC- PrefixSpan algorithm through experiments. By session filter, we have filtered out the non-human user sessions, leaving only the human user sessions for sequential pattern mining. We choose a one-day period of THZY s and WLXT s human user sessions respectively as the data sets of experiments. Firstly, we identify the transactions from human user sessions, and construct the transaction database for each of the two websites. The numbers and the average lengths of sessions and transactions are shown in Table VI. TABLE VI. Website THE NUMBERS AND THE AVERAGE LENGTHS OF SESSIONS AND TRANSACTIONS The number of sessions The average length of sessions The number of transactions The average length of transactions THZY WLXT Then we study the performance of CIC-PrefixSpan algorithm in comparison with GSP algorithm and PrefixSpan algorithm. We use these three algorithms to mine the sequential patterns in the above two transaction databases. The environment of the experiments is a machine with Intel(R) Xeon(TM) 3.00GHz CPU and 2GB main memory, running Red Hat Enterprise Linux AS release 4 operating system. Three algorithms, GSP algorithm, PrefixSpan algorithm and CIC-PrefixSpan algorithm, are implemented by using C++ language. All the programs are compiled in g The runtime of the three algorithms in mining THZY's transaction database (min_support = 0.02%~1%) GSP PrefixSpan CIC-PrefixSpan Support threshold (in %) Figure 2. The runtime of the three algorithms in mining THZY s transaction database Figure 2 shows the runtime of the three algorithms at different support thresholds in mining THZY s transaction database. The range of support threshold is from 0.02% to 1%. We can see that GSP algorithm is the slowest, and CIC- PrefixSpan algorithm is slightly faster than PrefixSpan algorithm. When min_support > 0.1%, the runtime of the three algorithms is very close to each other. When min_support < 0.1%, the runtime of GSP algorithm begins to increase rapidly, while the runtime of the other two algorithms increases more steadily. And the gap between PrefixSpan algorithm and CIC- PrefixSpan algorithm is still very small. From Table VI we can see that the average length of THZY s transactions is very short (only 1.83). In this case, the efficiency of CIC- PrefixSpan algorithm is very close to that of PrefixSpan algorithm. So the effectiveness of CIC-PrefixSpan algorithm is not obvious, although it can improve the mining efficiency a little.

7 Runtime (in seconds) The runtime of the three algorithms in mining WLXT's transaction database (min_support = 0.1%~10%) GSP PrefixSpan CIC-PrefixSpan Support threshold (in %) Figure 3. The runtime of the three algorithms in mining WLXT s transaction database Figure 3 shows the runtime of the three algorithms at different support thresholds in mining WLXT s transaction database. The range of support threshold is from 0.1% to 10%. We can see that GSP algorithm is still the slowest. As min_support decreases, the runtime of GSP algorithm increases faster and faster. In contrast, the runtime of the other two algorithms increases much more steadily. CIC-PrefixSpan algorithm is more efficient than PrefixSpan algorithm. The gap between them is larger than that in mining THZY s transaction database. In most cases, the runtime of CIC- PrefixSpan algorithm is only about half of that of PrefixSpan algorithm. Table VI shows that the average length of WLXT s transactions is much longer than that of THZY s transactions. So when the sequences are longer, CIC-PrefixSpan algorithm can improve the mining efficiency obviously. Furthermore, in order to validate the accuracy of the mining result, we also register the numbers of maximal sequential patterns mined by the three algorithms at different support thresholds, as shown in Table VII and Table VIII. We can see that the number of maximal sequential patterns mined by CIC-PrefixSpan algorithm is very close to those of maximal sequential patterns mined by PrefixSpan algorithm and GSP algorithm at each support threshold. This indicates that CIC-PrefixSpan algorithm can find out almost all the maximal sequential patterns. TABLE VII. THE NUMBERS OF MAXIMAL SEQUENTIAL PATTERNS AT DIFFERENT SUPPORT THRESHOLDS IN MINING THZY S TRANSACTION DATABASE min_support GSP PrefixSpan CIC-PrefixSpan 1% % % % % % % % TABLE VIII. THE NUMBERS OF MAXIMAL SEQUENTIAL PATTERNS AT DIFFERENT SUPPORT THRESHOLDS IN MINING WLXT S TRANSACTION DATABASE min_support GSP PrefixSpan CIC-PrefixSpan 10% % % % % % % % In summary, our experiments show that CIC-PrefixSpan algorithm has the best performance among the three algorithms tested. It can significantly improve the mining efficiency when the sequences are longer and the support thresholds are lower. And it can also effectively find out the complete set of maximal sequential patterns. VIII. CONCLUSIONS In this paper, we present an efficient method of web sequential pattern mining. This mining method categorizes the user sessions into human user sessions, crawler sessions and resource-download user sessions according to their features. Then it filters out the non-human user sessions, leaving only the human user sessions for mining. In order to find users meaningful sequential patterns, this mining method divides the human user sessions into transactions, and mines the sequential patterns in the transaction database. We use the concept of MFP to define transaction. We also present a method of identifying transactions based on users access path tree. In order to improve the mining efficiency, we make some improvements on PrefixSpan algorithm. The improved algorithm called CIC-PrefixSpan algorithm works quite well in the experiments. Using this mining method can efficiently find out all the useful web sequential patterns. With the development of Internet, the composition of user sessions is more and more complicated. The idea of categorizing user sessions is very important for the analysis of user behavior. It can help to improve the accuracy of analysis. Besides the three types of users mentioned in this paper, there may exist some other types of users that we haven t considered. For non-human users, we can also analyze the characteristics of their behavior. These issues are to be studied in future. REFERENCES [1] R. Cooley, B. Mobasher, J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns[J]. Journal of Knowledge and Information System, 1999: 5~31. [2] R. Cooley, B. Mobasher, J. Srivastava. Web Mining: Information and pattern discovery on the World Wide Web. In International Conference on Tools with Artificial Intelligence, 1997: 558~567. [3] R. Agrawal, R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, 1994: 487~499. [4] R. Agrawal, R. Srikant. Mining Sequential Patterns[C]. Proc 1995 Int Conf Data Engineering(ICDE 95). Taipei: IEEE Computer Society, 1995: 3~14.

8 [5] R. Agrawal, R. Srikant. Mining Sequential Patterns: Generalizations and Performance Improvements[C]. Proc 5th Int Conf Extending Database Technology(EDBT). Avignon: Lecture Notes in Computer Science, 1996: 3~17. [6] J. Han, J. Pei. FreeSpan: Frequent Pattern-projected Sequential Pattern Mining[C]. Proc 2000 Int Conf Knowledge Discovery and Data Mining(KDD 00). Boston: ACM Press, 2000: 355~359. [7] J. Pei, J. Han, et al. Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 6(10): 1~17. [8] J. Pei, J. Han, B. Mortazavi-asl, H. Zhu. Mining access patterns efficiently from web logs. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000: 396~407. [9] M. Arlitt. Characterizing Web User Sessions. SIGMETRICS Performance Evaluation Review, 2000, 28(2): 50~63. [10] M. Dikaiakos, A. Stassopoulou, L. Papageorgiou. Characterizing Crawler Behavior from Web Server Access Logs[C]. Proceedings of the 4th International Conference on E-Commerce and Web Technologies(EC-Web 03). Springer, 2003: 369~378. [11] M. S. Chen, J. S. Park, P. S. Yu. Data Mining for Path Traversal Patterns in a Web Environment[C]. In Proceedings of the 16th International Conference on Distributed Computing Systems, 1996: 385~392. [12] J. Srivastava, R. Cooley, M. Deshpande, et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data[J]. SIGKDD Explorations, ACM SIGKDD, 2000, 1(2): 12~23. [13] M. Arlitt, C. Williamson. Web server workload characterization: The search for invariants. In ACM SIGMETRICS Conference, 1996: 126~137. [14] O. R. Zaiane, M. Xin, J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In: Advances in Digital Libraries Conference, Santa Barbara,1998: 19~29. [15] B. Mobasher, N. Jain, E. Han, J. Srivastava. Web Mining: Pattern discovery from World Wide Web transactions. Technical Report TR , Department of Computer Science, University of Minnesota, Minneapolis, [16] F. M. Facca, P. L. Lanzi. Mining interesting knowledge from weblogs: a survey. Data & Knowledge Engineering, 2005, 53(3): 225~241. [17] J. Han, M. Kamber. Data Mining Concepts and Techniques, Morgan Kaufmann, Jingjun Zhu was born in Jiangmen, Guangdong Province, China in the year He is a graduate student for Master degree in Tsinghua University. His research interests include web mining, web performance evaluation, etc. Haiyan Wu was born in Daqing, Heilongjiang Province, China in the year She is a senior engineer in Tsinghua University. She has PhD degree. Her research interests include network information security, educational informationization, etc. Guozhu Gao was born in Pingyao, Shanxi Province, China in the year He has Master degree. And he is working in Tsinghua University. His research interests include educational informationization, database management technology, etc.

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract