Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining

Size: px

Start display at page:

Download "Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining"

Maryann Reed
5 years ago
Views:

1 Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining The web log file gives a detailed account of who accessed the web site, what pages were requested, and in what order and how long each page was viewed. However, log files are not only unstructured but also distorted in many cases. Especially, log files could be seriously distorted when web pages are requested by the users routed through proxy servers. Therefore, preparative processing is necessary prior to the analysis and discovery of meaningful information. In this article, an algorithm is developed to identify the users and their navigational paths when users are routed through proxy servers. The proposed algorithm is then experimentally evaluated using a real website and ten groups of users, each with two or three people. The experimental results show that the average ratios of correct and incorrect page restoration are 78% and 4.1%, respectively, which indicate that the proposed algorithm can be used as a reasonable tool for identifying the navigational paths of the users routed through proxy servers. Keywords: Web Log File ; Proxy Server ; User Identification 1

2 Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining Yong Soo Kim* and ong-jin Yum** * Department of Industrial Engineering, Korea dvanced Institute of Science and Technology, Gusung-Dong, Yusung-Gu, Taejon , Korea. Tel : ; Fax: ; mmps@kaist.ac.kr ** Corresponding author. Department of Industrial Engineering, Korea dvanced Institute of Science and Technology, Gusung-Dong, Yusung-Gu, Taejon , Korea. Tel : ; Fax: ; bjyum@kaist.ac.kr. 2

3 1 Introduction The World Wide Web (WWW) continues to grow at an astounding rate in terms of traffic volume, size and complexity. long with this growth, the complexity of such tasks as web site design, web server design, etc. has also increased. n important input to these design tasks is the analysis of how a web site is being used, and therefore, it is necessary for web designers to analyze and discover relevant information from the WWW [1]. Log files (access, agent, error, and referrer log files) which are recorded by web servers contain information on all incoming requests, user s browser type, operating system, server error type and the referrer header. Log files, however, are unstructured as well as distorted in many cases. In particular, log files could be seriously distorted when web pages are requested by the users routed through proxy servers. proxy server acts as an intermediary between the users of a local network and the internet so that the local network can ensure security, administrative control and caching service. proxy server is associated with or constitutes a part of the gateway server that separates the local network from the outside ones, and also serves as a firewall that protects the local network from outside intrusion. When a web page is requested by a user in the local network, the proxy server uses its own IP address to request the page, which is called IP masquerading. Therefore, the IP address of the proxy server, and not that of the user, is recorded in the log files [3]. Furthermore, a proxy server also functions as a cache server that looks into its local cache of previously downloaded web pages. If it finds the page, it returns the page to the user without forwarding the request to the internet. In this case, the request is not recorded in the log file [2]. 3

4 s mentioned above, log files could be distorted due to IP masquerading and cache service of the proxy server. For example, assume that two users who are routed through a proxy server visit the same web site. If one visits pages --C-D and the other visits pages - -F later, then pages --C-D will be requested from the web server for the first user. However, only the page F will be requested for the second user because pages - are already downloaded in the proxy server. Therefore, the log file will record requests for pages --C- D-F instead of pages --C-D---F. Such a distortion in a log file needs to be corrected to analyze and discover useful information from the WWW. Cooley et al. [1] presented an algorithm for identifying the users routed thorough a proxy server without cache service. In this paper, the work by Cooley et al. is extended to the case where cache service is also provided by the proxy server. 2 lgorithm The proposed algorithm identifies the navigational paths of the users routed through a proxy server. The algorithm constructs the browsing path for each user by analyzing the access log in conjunction with the referrer log, agent log, and site topology. The access log shows who accessed the web site and what pages were requested. The referrer log contains the referrer header of an incoming request and the agent log shows what browser was used. 2.1 Overview of the Proposed lgorithm The idea behind the proposed algorithm is as follows. The records with the same IP address and the same agent are assumed to be of a single user. However, if the access log, 4

5 referrer and site topology indicate otherwise, then the records are divided into paths of different users. In addition, a path completion procedure is provided for a newly generated user path. In the algorithm, the index page means the first page of the web site and a parent page is an immediate upper page of the current page in the web site topology. brief description of the algorithm is shown in Figure 1. <Fig. 1> 2.2 Detailed Procedures In Stage 1, records with the same IP address are sorted in chronological order. Procedures for Stages 2 and 3 are as follows. Stage 2 Figure 2 shows a flowchart for the procedures in Stage 2 for each record in the same IP address. tree represents the navigational path of a single user. <Fig. 2> First, in the case where the requested page is the first record, a tree is constructed with that page alone (see step 9 in Figure 2). Otherwise, it is decided whether the requested page should be assigned to a tree or reserved. If there exists a tree which meets condition 1 or 2 below, the requested page is assigned to that tree. 5

6 Condition 1: The agent of the tree is identical to that of the requested page and the tree contains a page which is identical to the referrer of the requested page (i.e. when the record is assigned to an existing tree, the site topology and referrer of the record is taken into consideration (see Figure 3)). <Fig.3> Condition 2: There is only one tree with its agent being identical to that of the requested page, and in the path from the most recently assigned page of the tree to the requested page in the site topology, the intermediate pages appear in other trees. The reason for the assignment of the requested page to the tree satisfying condition 2 is as follows. Due to the cache service of a proxy server, previously requested pages do not appear in the access log. Therefore, if the intermediate pages (including the referrer of the currently requested page) appear in other trees, then it is likely that the currently requested page was requested by the user through those intermediate pages. In Figure 4, the intermediate pages in the path from C to F in the site topology are and D. Since and D appear in Tree 2, F is assigned to Tree 1. <Fig. 4> In the case of condition 3 (i.e., there are multiple trees with the same agent as that of the requested page), the page is reserved for a later assignment based on additional decision criteria 6

7 (see step 4 in Figure 2). If the page does not meet condition 1, 2 or 3, the page is regarded as a new user (see step 8 in Figure 2). fter ssignment of the page to a tree which meets condition 1 or 2 (step 2 in Figure 2), it is checked whether or not two-user navigational patterns are found in the tree. time sequence of assignments in which an unlikely backtracking exists suggests a two-user navigational pattern. If the tree is suspected to have the navigational pattern of two users, then it is split into two trees. Of the two branches under consideration, the one with the request time of its first page being more recent is taken off the tree. Condition for splitting the tree into two trees is as follows. Condition 4: The referrer of the requested page is not identical to any of the pages in the path from the page assigned to the tree just prior to the requested page to the index page in the site topology. s shown in Figure 5, the time sequence of the pages in the tree suggests a two-user navigational pattern since it is highly unlikely that a single user, after navigating in the order of --C-D-F, would backtrack to C to access E. Therefore, the tree is split into two separate trees. s for deciding which branch to take off the tree, D-F is chosen since C was requested before D from. Thus, it is most likely that --C-E represents the navigational pattern of one user, and D-F can be regarded as the navigational pattern of the other. <Fig. 5> 7

8 In the case of tree splitting (step 3 in Figure 2) or assignment reservation (step 4 in Figure 2), the page in question can be reassigned according its order within the time sequence. If the request time of the first page in the split branch or of the reserved page is later than that of the lastly assigned page in each of the existing trees and if there is only one such tree (steps 5,6 in Figure 2), then it is reassigned to that tree. For example, in Figure 6, the request time of D, which is the first page in the split branch, is later than that of, which is the lastly assigned page in the tree, and therefore D-F is assigned to Tree 1. The reason for this reassignment is based on the assumption that the record is of a single user with the same agent. <Fig. 6> On the other hand, in the case where the condition in step 5 in Figure 2 is satisfied but the condition in step 6 is not, assignment is reserved again. If the condition in step 5 in Figure 2 is not satisfied, the page or the split branch is regarded as a new user s (step 8 in Figure 2). Finally, the page or the split branch regarded as a new user s in step 8 in Figure 2 needs a path completion. Path completion is carried out by connecting the index page to the first page of the new user tree along the shortest path. This serves to restore pages which have been skipped in the access log file of the web server. For example, in Figure 7, the path is completed by connecting the index page to D, the first page of the new user tree, using the shortest path in the site topology (see Figure 4). <Fig. 7> 8

9 Stage 3 If there exist reserved branches after Stage 2 is completed for all records, then these branches are assigned to an existing tree or regarded as a new user tree according to the site topology and their orders within the time sequence. Detailed procedures for Stage 3 are as follows. Stage 3: ssignment of a reserved branch (or a page) to another tree or regarding it as a new user tree. Step 1. If the last page of a tree, which has the same agent as the reserved branch, has an earlier request time than the first page of the reserved branch, the following is carried out: i) If such a tree is unique, assign the reserved branch to the tree. ii) If either the first page or the parent page of the reserved branch is identical to the lastly assigned page or to its parent page of a tree, then the reserved branch is assigned to that tree. Step 2. If the reserved branch is not assigned in Step1, it is regarded as a new user tree. In this case, the shortest path from the index page to the reserved branch constitutes a new user path. s an illustration of Step 1, see Figure 8. Since the parent page of E, the first page in the reserved branch, is C in the site topology (see Figure 4), and C is the lastly assigned page, the reserved branch E-G is assigned to Tree 1. <Fig. 8> 9

10 2.3 Example ssume that a web site (see Figure 9) is visited by three users who are routed through a proxy server and that the paths of the three users in chronological order are --F-G, -C-H and --E-D, respectively. Suppose that the agent logs for --F-G and -C-H are identical while the agent log for --E-D is a different one (the italicized path denotes a different agent log). In such a case, each of the repeatedly requested pages is recorded only once in the log file. Therefore, the recorded requested pages are --F-C-E-G-D-H in this example. <Fig. 9> ased on the access log, referrer log, agent log and site topology, a hierarchical tree is constructed as shown in Figure 10 (by steps 9 and 2 in Figure 2). The tree --F-C, as depicted in Figure 10, can be considered as a one-user navigational pattern since the tree does not satisfy condition 4. <Fig. 10> Since the agent of the fifth record, E, is different from that of the previous tree (--F-C), it is classified as Others by step 1 in Figure 2 and is regarded as a new user s. Then, the path is completed by connecting E to the index page along the shortest path (by Path completion in Figure 2) as follows. 10

11 <Fig. 11> Since the agent of the sixth record, G, is identical to that of --F-C and the referrer of G is F, the tree is expanded as shown in Figure 12 (by step 2 in Figure 2). <Fig. 12> However, --F-C-G is not considered as a single-user navigational path by the proposed algorithm since the tree satisfies condition 4. Therefore, step 3 in Figure 2 is carried out by separating C from --F-C-G and completing the path from to C according to the referrer of C as follows. <Fig. 13> The seventh and the eighth records, H and D, respectively, are assigned to the agent and referrer of the records in step 2 in Figure 2. s shown in Figure 14, the proposed algorithm correctly identifies the three users and their navigational patterns. <Fig. 14> 3. Performance Evaluation Experiments were conducted to evaluate the proposed algorithm in Section 2. n experimental homepage was established on a web server and participants navigated the website 11

12 through a proxy server. Then, the performance of the proposed algorithm was evaluated by analyzing the experimental results. 3.1 Experimental Environment real website of a company was used as the experimental website. The website is composed of fifty pages with the structure as shown in Figure 15. <Fig. 15> The softwares IIS(Internet Information Server, Microsoft) and Midpoint [4] were used to host the web site and the proxy server, respectively. The constructed experimental environment is shown in Figure 16. <Fig. 16> The experiments were conducted using five groups of two users and another five groups of three users. Experiments were conducted twice for each group. That is, in the first experiment, each group was asked to navigate about ten pages (8~12pages), and in the second experiment, about fifteen pages (13~17pages). This results in ten samples from ten-pagenavigation tasks and another ten samples of fifteen-page- navigation tasks. In the experiments with two users, each computer had the same operating system and browser. That is, the users had identical agent logs. However, in the experiment with three users, two computers had the same operating system and browser but the third used a different 12

13 browser to reflect different market shares of the browsers. In 1999, Explorer and Netscape were used by 68.57% and 29.53% of the users, respectively, while others accounted for less than 1.9% [5], which indicates that about two thirds of the users use Explorer while one third of the users use Netscape. Therefore, the experimental conditions were set up such that two users had identical agent logs while the third user had a different one. 3.2 Experimental Results Correct and incorrect page restoration ratios were used to evaluate the proposed algorithm. The correct page restoration ratio is a relative measure of how well the navigated path was restored using the proposed algorithm and the incorrect page restoration ratio is a relative measure of mistakenly assigned paths. For example, assume that, for the actual navigational path of C-C1, the following path was restored using the proposed algorithm. <Fig. 17> Since the path -1- and C-C1 were correctly restored, the correct page restoration ratio is no. of correctly identified pages total no. of pages in the navigational path 5 = 100 = 62.5(%) 8 (1) Since the path D-D1 is mistakenly restored, the incorrect page restoration ratio is total no. of pages not in the navigational path total no. of pages in the identified path 2 (2) = 100 = 28.6(%) 7 13

14 The experimental results are shown in Table 1. <Table 1> 4. Conclusion n algorithm is proposed for identifying the navigational paths of the users who are routed through proxy servers and is evaluated by conducting experiments. The experimental results show a correct page restoration ratio of 78% and incorrect page restoration ratio of 4.1% on average, which indicates that the proposed algorithm can be used as a reasonable tool for the identification of the navigational paths of the users routed through proxy servers. Future work will include further tests to validate the algorithm as well as its improvement using information from cookies. References [1] Cooley R, Mobasher, and Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1), [2] Pitkow J (1997) In search of reliable usage data on the WWW. Computer Networks and ISDN Systems, 20, [3] [4] [5] 14

15 Figure 1 Stages of the proposed algorithm Figure 2 Flowchart for Stage 2 Figure 3 Example of step 2 in Figure 2 Figure 4 Example of step 2 in Figure 2 Figure 5 Example of step 3 in Figure 2 Figure 6 Example of step 7 in Figure 2 Figure 7 Example of Path completion Figure 8 Example of Stage 3 Figure 9 Structure of the web site Figure 10 lgorithm execution example (1) Figure 11 lgorithm execution example (2) Figure 12 lgorithm execution example (3) Figure 13 lgorithm execution example (4) Figure 14 lgorithm execution example (5) Figure 15 Structure of the experimental website Figure 16 Constructed experimental environment Figure 17 Restoration example Figure Captions Table 1 Experimental results Table Titles 15

16 Stage 1 Sorting records with the same IP address in chronological order Stage 2 Making a hierarchical tree based on access, agent, referrer log and site topology Stage 3 ssignment of a reserved branch to another tree or regarding it as a new user tree ssignment of a reserved branch which cannot be assigned in Stage 2. - Construction of a tree - ssignment of a page to a tree - Splitting of the constructed tree if two-user navigational patterns are found. - Path completion <Figure1> 16

17 Stage 1 Condition 1 or 2 Is the requested page the first record? 1 No Page ssignment condition? (See Fig. 3 and 4) Yes 9 Construction of a tree with that page alone. Others 2 ssignment of the page to a tree which meets condition 1 or 2. (See Fig. 3 and 4) 4 Condition 3 Reservation of the assignment of the page No 3 re two-user navigational patterns found? (Condition 4 satisfied? ) (See Fig. 5) Yes Splitting of the constructed tree into two trees (See Fig. 5) No Reservation of assigning the reserved page or the split branch to another tree. 5 Is the request time of the reserved page or the first page of the split branch later than that of the lastly assigned page in each of the existing trees? (See Fig 6) 7 6 Yes Is there only one tree which satisfies the above condition? (See Fig. 6) Yes Reassignment of the reserved page or split branch to another tree. (See Fig. 6) 8 No Regarded as a new user s Path completion (See Fig. 7) Stage 3 <Figure 2> 17

18 Tree 1 Tree 2 Tree 1 Tree 2 C D C D Trees 1 and 2 have the same agent. New record: E The referrer of the new record: C E E is assigned to Tree1. <Figure 3> C D E F G Site topology Tree 1 Tree 2 Tree 1 Tree 2 C D C D D F gent of Trees 1 and 2 are different. New record: F The referrer of the new record: D The agent of the new record, F, is identical to that of Tree1. Most recently assigned page to Tree1: C <Figure 4> 18

19 1 2 C 3 D 4 E 6 F 5 - Subscripts indicate the order of assignment. - Index page: - The requested page E was assigned to the tree in step 2 in Fig The referrer of the requested page: C - Page assigned to the tree just prior to the requested page : F 1 2 C 3 D 4 E 6 F 5 - Since the referrer C of the current page E does not belong to the path, --D-F, the branch C-E or D-F is taken off the tree. 1 2 C 3 D 4 E 6 F 5 C E D F Consider the request time of the first page of each branch. Since the request time of D is later than that of C, D-F is taken off the tree. <Figure 5> Tree Split branch D 3 F 4 Tree 1 D F <Figure 6> 19

20 New user tree D F fter path completion D F <Figure 7> Tree 1 Tree 2 Reserved branch Tree 1 Tree 2 E G C D C D E G <Figure 8> C D E F H G <Figure 9> 20

21 C F <Figure 10> C F C F E E <Figure 11> C F G <Figure 12> 21

22 C C C F G F G F G E E <Figure 13> C H F E G D <Figure 14> 22

23 Home C D C1 C2 C3 C4 C5 D1 D2 D3 D4 <Figure 15> User computers Web server (IIS) Proxy server (Midpoint) <Figure 16> 23

24 (navigated path) C C1 (restored path) 1 C C1 D D1 <Figure 17> 24

25 Correct restoration ratio Incorrect restoration ratio Two users visiting about ten pages Two users visiting about fifteen pages Three users visiting about ten pages Three users visiting about fifteen pages 86.8 % 79.3% 72.5% 73.4% 3.8% 0.9% 4.1% 7.5% <Table 1> 25

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract