Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Size: px

Start display at page:

Download "Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal"

Elijah Dennis
5 years ago
Views:

1 Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Mohd Helmy Ab Wahab 1, Azizul Azhar Ramli 2, Nureize Arbaiy 3, Zurinah Suradi 4 1 Faculty of Electrical and Electronic Engineering Kolej Universiti Teknologi Tun Hussein Onn(KUiTTHO) Parit Raja, Batu Pahat, Johor. Tel: ext. 1230, Fax: , helmy@kuittho.edu.my, 2,3,4 Faculty of Information Technology and Multimedia Kolej Universiti Teknologi Tun Hussein Onn(KUiTTHO) Parit Raja, Batu Pahat, Johor. Tel: ext. 8056, Fax: , azizulr@kuittho.edu.my, nureize@kuittho.edu.m, zurinah@kuitthoedu.my Abstract The advent of the World Wide Web has caused an explosive growth in its size and usage. The web itself provides a rich and surprising data mining source. Hence, it made necessary for Web site administrators to track and analyze the navigation patterns of Web site visitors. However, data mining techniques are not easily apply to Web data due to problems both related with the technology underlying the Web and the lack of standards in the design and implementation of Web pages. Information collected by the Web servers are kept in the server log. This is the main source of data for analyzing user navigation patterns. Once logs have been pre-processed and sessions have been obtained, there are several kinds of access pattern mining that can be performed depending on the needs of the analyst. In this study, a Data Mining technique known as Association Rules was used in order to get some insights into website usage pattern. For the purpose of this study, server logs from Utusan Education portal were retrieved, pre-processed and analyzed. Keywords: Web Mining, Web Usage Mining, Generalized Association Rules, Tutor.com, Server Logs Introduction Finding useful patterns in data is known by different names including data mining. It is a process of discovering various models, summaries, and derived values from a collection of large data. Data mining is a process of extracting large data and used to determine relationships using Artificial Intelligence (AI) or statistical techniques [7]. With this useful information it can be utilized to guide professional in making decision and do scientific research [2]. Data mining also can be applied in the various field including computational linguistics, statistics, informatics, artificial intelligence and knowledge discovery. Moreover, data mining can also be used to discover and analyze useful information from the Web [9]. One of the data mining techniques is web mining. According to [8], Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents. It consists of Web Content Mining, Web Structure Mining and Web Usage Mining [12]. Refer to Fig 1. Figure 1 Taxonomy of Web Mining Web Content Mining involves mining web data contents [10] ranging from the HTML based document and XML-based documents accumulated in the web servers. The goal of Web Content Mining is mainly to assist or to improve information finding or filtering the information. In contrast, Web Structure Mining tries to discover the model that based on the hyperlinks topology and document structure on the Web. Web Structure Mining aims to generate structural summary about web sites and web pages. In turn, we will discover the structure information of the Web. On the other hand, Web Usage Mining focuses on discovery of user access patterns from web server logs. Web server logs 712

2 generate log files which records web server activity. [3] suggested that web usage mining includes data related to the usage of the pages of a website such as IP address, page references and the date and time access. One of data mining techniques that are commonly used in web mining is Association Rules. In brief, an association rule is an expression X => Y, where X and Y are sets of items. The meaning of such rules is quite intuitive: Given a database D of transactions where each transaction T 'LVD set of items, X => Y expresses that whenever a transaction T contains X than T probably contains Y also. The probability or rule confidence is defined as the percentage of transactions containing Y in addition to X with regard to the overall number of transactions containing X. Since its introduction in 1993 [1] the task of Association Rule mining has received a great deal of attention. As we enter the age of Web technology, the amount of information available on the World Wide Web (WWW) has rapidly increased [11]. In generals, for a particular website, more than hundred thousand of users will be access it. This server transactions activity will be recorded in a server log files. Log files produced by the web servers are in the form of text files. This file consists of HTTP transactions that processed by hash coding URLs, IPs, client environment and cookies. The Web logs provide information to ensure adequate bandwidth and server capacity on their organizations website. Log file data can offers valuable insight of web site usage among users. It reflects actual usage in natural working condition, compared to the artificial setting of a usability lab. Log data file represents the activity of many users, over potentially long period of time, compared to a limited number of users for an hour or two each. The general objective of this project is to determine user browsing and access pattern using Association Rules. Specifically, the objectives of this project are 1. To determine and discover the user access pattern from Education portal of Utusan server. 2. To analyze the usage patterns output and user behaviours of Utusan Education portal from the Web usage mining implementation process. Method Figure 2 Flow Chart of Log File Creation Fig. 2 shows how a system administrator can gather information from the server logs. Basically, when a user sends queries to the server, the requested information will be retrieved from the database. At the same time, the user session including the URL, Client s IP address, accessing date and time, query stem will be recorded in the server logs. Thus, this server logs will be preprocessed and mined in order to get some insight into the usage of a server site as well as the users behaviour. In this study, the methodology used is adopted from [13]. Since the log files are continuously generated, it contains of large amounts of log data. Apart from that it also required large storage device. Therefore, analyzing and exploring regularities in web log records can enhance the quality and delivery of Internet information services to the end user, and improve Web server system performance as well. Web mining application in education field is not new. The previous research proposed the beyond usage mining to consider the content of the pages that have been visited.in the web-based learning environment, both learner s browsing behaviours and course content are important to derive learner s learning levels, intention, goals and interest. Incorporating course content can aid in an understanding of learners browsing habits. In particular, understanding the learners browsing behaviors can facilitate the course content personalization. Log File Figure 3 Association Rules The log file was retrieved from the portal server to be analyzed. However, one of the main problems encountered when dealing with the log files is the amount of data needs to be pre-processed. A sample of a single entry log file contains the following information: :00: CSLNTSVR GET /tutor/include/style03.css HTTP/1.1www.tutor.com.myMozilla/4.0+(compatible;+MSI 713

E+5.5;+Windows+98;+Win+9x+4.90)ASPSESSIONIDCST SBQDC=NBKBCPIBBJHCMMFIKMLNNKFD;+browser= done;+aspsessionidaqrrcqcc=lbdgbpibdfcok HMLHEHNKFBN http://www.tutor.com.

3 E+5.5;+Windows+98;+Win+9x+4.90)ASPSESSIONIDCST SBQDC=NBKBCPIBBJHCMMFIKMLNNKFD;+browser= done;+aspsessionidaqrrcqcc=lbdgbpibdfcok HMLHEHNKFBN For this particular session, there are 19 attributes identified from the log file. For this study, the attributes such as cookies, hostname, server IP will be removed. Pre-processing Pre-processing is a task of converting the usage, content, and structure information from the data sources into data abstractions necessary for pattern discovery. Before data mining algorithm can be applied, data pre-processing must be performed to convert the raw data into data abstraction necessary for the further processing say Pattern Analysis. This process involves data extraction and data cleaning tasks. Pattern Mining - Association Rule In the Web domain, the pages, which are most often referenced, can be put in one single server session by applying the association rule generation. Association Rule mining techniques can be used to discover unordered correlation between items found in a database of transactions [5]. [6] pointed that in the term of the Web usage mining, the association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. The support is the percentage of the transactions that contain a given pattern. Results The Web Usage Mining task for Utusan Education Portal is divided into two main stages. Each stage has its own phases with certain sub activities or tasks. The first stage includes data retrieving and pre-processing. The second stage involves Pattern Discovery where Association Rules are applied. General Pattern Analysis Results (access pattern and users behaviors descriptive statistic) The Utusan Education portal has several options (menu, content, dunia, info, etuisyen, banksoalan and interaktif) that can be accessed by the users. Based on the Universal Resource Locator (URL) stem, the users accessed the portal options host named Figure 4 10 Most Requested Directories in Tutor.com Portal Association Rules Results (support and confidence of the different options) Fig. 4 shows that, /index path is the most requested page and it followed by the /content path. The /index path is the top level for Tutor.com portal and it displays the general information about the etuisyen, banksoalan, interaktif, bilikguru, komuniti and petalaman options. With /index path and other options path, user also can select other options that provided by Utusan Education portal. Association Rules is used to extract rules in the form of X Y (if X then Y) quantified with a confidence (proportion of occurrences that verifies Y among occurrences that verifies X) and a support (proportion of occurrences that verifies X and Y among all occurrences). Fig. 5, shows the Association Rules, including support and confidence by applying Apriori algorithm for identifying the patterns, defining a threshold of 15% for the minimum support and a threshold of 70% for the minimum confidence. /banksoalan <- /index /content (17.4%, 73.6%) /content <- /index /etuisyen (16.9%, 87.4%) /info <- /dunia /content (16.4%, 74.7%) /interaktif <- /banksoalan /index (17.8%, 85.0%) /etuisyen <- /banksoalan /content (17.0%, 99.0%) /info <- /contentrm /etuisyen1 (15.6%, 84.7%) /dunia <- /index /content (15.9%, 71.8%) /index <- /f_menu /dunia (19.6%, 75.3%) /banksoalan<- /etuisyen /content (17.6%, 83.8%) /interaktif <- /index /etuisyen1 (15.7%, 96.4%) Figure 5 Output for Tutor.com Portal Options Association Rules (Related Pages) Fig. 5 also shows the pages that are related to each other where the most frequent options were being selected during the certain options requested. 714

selected the /banksoalan and /content option path, user also selected the /etuisyen option path. Fig. 6 represents a graphical chart for the 6 most accepted rules for options relationship.

4 Based on the Fig. 5 above, it can be concluded that the rule with higher support (19.6%) means, if in a session the user selected the /f_menu and /dunia options path, user also selected the /index option path ; the rule with higher confidence (99%) says that if in a session, the user selected the /banksoalan and /content option path, user also selected the /etuisyen option path. Fig. 6 represents a graphical chart for the 6 most accepted rules for options relationship. Figure 8 The 6 Most Accepted Rules for Tutor.com Education Options Association Hyperedges (orderly archived) Discussion Figure 6 The 6 Most Accepted Rules for Tutor.com Education Portal (related options) Fig. 7 below shows the Association Hyperedges for Tutor.com Education Portal that illustrates the portal pages those orderly archived. A threshold of 10% for the minimum support and a threshold of 75% for the minimum confidence are being used. It shows that support of 13.7% is the high percentage of transactions that contain all items appearing in the hyperedge, that is in the /index /content /f_menu with the percentages of confidence is 87.5%. The confidence of 97.5% with 10.2% of support is on /banksoalan /interaktif /etuisyen /content is represent the highest of average confidence of all rules that can be formed using the items in the hyperedge. /banksoalan /interaktif /index (10.8%, 90.4%) /index /content /f_menu (13.7%, 87.5%) /etuisyen /banksoalan /content (10.6%, 92.8%) /interaktif /index /content (11.8%, 76.9%) /dunia /info /content (10.4%, 85.0%) /content /banksoalan /index (10.0%, 79.4%) /etuisyen /etuisyen1 /index (11.8%, 87.0%) /interaktif /banksoalan /etuisyen /f_menu (10.7%, 78.4%) /banksoalan /etuisyen /content1 /content (10.5%, 90.7%) /banksoalan /interaktif /etuisyen /content (10.2%, 97.5%) Figure 7 The Output for Tutor.com Education Portal Options Association Hyperedges (orderly archived) Furthermore, Fig. 8 shows the six most accepted Association Rules Hyperedge for Tutor.com Education server log files during selected period of time. At present, Web usage mining extract patterns in Web server logs. The findings from this study provide an overview of the usage pattern of Tutor.com Portal. The results from this study are useful for the Web administrator in order to improve Web services and performance of the Web sites, in terms of their contents, structure, presentation and delivery. With this information, web developer can design better Web Pages which will attract more users to visit the website. The study also demonstrates the use of Association Rules in Web Usage Mining by extracting rules from the log files. The outcome of this study can be used by the Tutor.com s System administrator as a guideline in enhancing the use of Tutor.com. Conclusion With the growth of Web-based applications, there is significant interest in analyzing Web usage data to better understand Web usage, and apply the knowledge to better serve users. This study use Association Rules to identify rules for Tutor.com Education Portal. From the findings, it showed that the etuisyen and banksoalan option has high access rates as compared to other options. These two options provide notes, exercises and trial examination questions that have been prepared by expert teachers. This education material is prepared according to the latest examination format and needs. For future study, maybe we should embed the web usage mining process in the web pages itself. With this, the system itself can automatically generate rules according to Association Rules. Therefore the system may perform and implement the Web usage mining phase including data selection, data preprocessing, pattern discovery and analysis. In future, another method for analyzing sparse data can also be used in the study of E- Learning Web log access for knowledge extraction from Web log data 715

5 References [1] Agrawal, R. and Srikant, R Fast Algorithms For Mining Association Rules. Proc. of the 20th VLDB Conference. pp [2] Chen, M.-S., Han, J., Yu, P.S Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineering, (8:6). Pp [3] Cooley, R., Mobasher, B., and Srivastava, J Web Mining: Information and Pattern Discovery on the World Wide Web. Technical Report, TR [4] Cooley, R., Tan, P.N., and Srivastava, J WebSIFT: The Web Site Information Filter System. Proceedings of the Web Usage Analysis and User Profiling Workshop (WebKDD 99). [5] Cooley, R.,,Mobasher, B., and Srivastava, J Data preparation for mining world wide Web browsing patterns. Knowledge and Information Systems. Vol 1. No.1. [6] Cooley, R Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. PhD thesis. Dept. of Computer Science, University of Minnesota [7] Fayyad, U. M., Piatetski-Shapiro, G., Smith, P From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining. Pp [8] Kosala, R. and Blockeel, H Web Mining Research: A Survey. ACM SIGKDD. Vol. 2. Issue 1.pp [9] Lee, R. S. T and Liu, J. N. K ijade eminer: A Web-Based Mining Agent Based on Intelligent Java Agent Development Environment on Internet Shopping. PAKDD LNAI pp [10] Madria, S., Bhowmick, S. S., Ng, W. K., and Lim, E. P Research Issue in Web Data Mining. Data Warehousing and Knowledge Discovery. [11] Mohammadian, M Intelligent Data Mining and Information Retrieval from World Wide Web for E- Business Applications. URL [12] Pal, S. K., Talwar, V., and Mitra, P Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions. IEEE [13] Xue, G. R., Zeng, H. J., Chen, Z., Ma, W. Y., and Lu, C. J Log Mining to Improve the performance of Site Search. Third Int. Conf. of WISEw

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract