WEB USAGE MINING BY, GIRIJA PATIL PRIYANKA PATKAR AANUM SHAIKH ADITI THAKKAR

Size: px

Start display at page:

Download "WEB USAGE MINING BY, GIRIJA PATIL PRIYANKA PATKAR AANUM SHAIKH ADITI THAKKAR"

Iris Cobb
5 years ago
Views:

1 A SYNOPSIS ON WEB USAGE MINING BY, GIRIJA PATIL PRIYANKA PATKAR AANUM SHAIKH ADITI THAKKAR

2 A SYNOPSIS ON Web Usage Mining BY Girija Patil Priyanka Patkar Aanum Shaikh Aditi Thakkar Under the guidance of Internal Guide Prof. Sumitra Sadhukhan Juhu-Versova Link Road Versova, Andheri(W), Mumbai-53 University of Mumbai

3 Juhu-Versova Link Road Versova, Andheri(W), Mumbai-53 This is to certify that 1. Girija Patil - B Priyanka Patkar - B Aanum Shaikh - B Aditi Thakkar B-759 Have satisfactorily completed this synopsis entitled Web Usage Mining Towards the partial fulfillment of the BACHELOR OF ENGINEERING IN (COMPUTER ENGINEERING) as laid by University of Mumbai. Guide Prof.S.Sadhukhan H.O.D. Prof. S. B. Wankhade Principal Dr.Udhav Bhosle Internal Examiner External Examiner 3

4 ACKNOWLEDGEMENT We wish to express our sincere gratitude to Dr. U. V. Bhosle, Principal and Prof. S. B. Wankhade, H.O.D of Computer Department of RGIT for providing us an opportunity to do our Seminar work on Web Usage Mining ". This Seminar bears on imprint of many people. We sincerely thank our Seminar guide Mrs. Sumitra Sadhukhan for her guidance and encouragement in successful completion of our Seminar work. We would also like to thank our staff members for their help in carrying out this Seminar work. Finally, we would like to thank our colleagues and friends who helped us in completing the Seminar successfully. 1. Girija Patil 2. Priyanka Patkar 3. Aanum Shaikh 4. Aditi Thakkar 4

5 Abstract Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web server data corresponds to the user logs that are collected at Web server. Some of the typical data collected at a Web server include IP addresses, page references, and access time of the users and is the main input to the present Research. Our main aim is to concentrate on web usage mining and in particular focus on discovering the web usage patterns of websites from the server log files. Web mining can provide companies managerial insight into visitor profiles, which help top management take strategic actions accordingly. The proposed work is an efficient algorithm for generating frequent access patterns from the access paths of the users. This algorithm is optimized to takes less time compared to the existing algorithms. The main aim of this algorithm is to reduce execution time and memory utilization as compared to the existing algorithm viz. Apriori algorithm. The frequent access patterns show the sequence of web pages which are frequently navigated by the user. The proposed algorithm i.e. a combination of Apriori and FP Growth Algorithm, searches for large item-sets during its initial database pass and uses its result as the seed for discovering other large datasets during subsequent passes. Thus, frequently accessed products can be discovered efficiently using the combination algorithm which plays a vital role in Business Intelligence (BI). 5

6 Table of Contents Chapter Topic Page No. No. 1 Introduction 1.1 Web Usage Mining Process Applications Review Of Literature 2.1 Apriori Algorithm 2.2 FP-Growth Algorithm Existing System 3.1 Input System Web Log Files Output System Proposed System 18 5 Design Details 5.1 Software Development Life Cycle(SDLC) 5.2 Steps in SDLC 5.3 Waterfall Model 5.4 DFD Implementation Plan 26 7 Analysis 7.1 Detail Of Hardware And Software 7.2Backend Conclusion 31 References 32 6

7 List of Figures Figure No. Figure Name Page No Web Usage Mining process Applications of Web Usage Mining Apriori algorithm flowchart Web log files Data extracted from web log files SDLC Gantt Chart For SDLC Waterfall Model User DFD User Usecase User Flowchart 16 7

8 CHAPTER 1 INTRODUCTION The Web is a huge, explosive, diverse, dynamic and mostly unstructured data repository, which supplies incredible amount of information, and also raises the complexity of how to deal with the information from the different perspectives of view, users, web service providers, business analysts. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further depending on the kind of usage data considered. They are web server data, application server data and application level data. Web server data corresponds to the user logs that are collected at Web server. Web usage mining refers to the automatic discovery and analysis of patterns in click stream and associated data collected or generated as a result of user interactions with web resources on one or more web sites. It consists of three phases which are data Pre-processing, pattern discovery and pattern analysis. These are explained in depth in section

9 1.1 Web Usage Mining Process PRE-PROCESSING: fig Web Usage Mining process Pre-processing include the fusion and synchronization of data from multiple log files, data cleaning, page view identification, user identification, session identification (or sessionization), episode identification, and the integration of click stream data with other data sources such as content or semantic information. PATTERN DISCOVERY: In the pattern discovery phase, frequent pattern discovery algorithms are applied on raw data. Web site designers should have clear understanding of user s profile and site objectives as well as an emphasized knowledge of the way users will browse web pages. 9

PATTERN ANALYSIS: In the pattern analysis phase interesting knowledge is extracted from frequent patterns and these results are used for website modification.

10 PATTERN ANALYSIS: In the pattern analysis phase interesting knowledge is extracted from frequent patterns and these results are used for website modification. The web usage pattern analysis is the process of identifying browsing patterns by analyzing the users navigational behaviour. The web server log files which store the information about the visitors of the websites is used as input for the web usage pattern analysis process. First these log files are pre-processed and converted into required formats so web usage mining techniques can apply on these web logs APPLICATIONS The figure shows Web Usage Mining applications which can be implemented using various techniques like sequence mining, Clustering, Classification, etc. Our focus is to implement Web Usage mining with the help of Association rules using algorithms like FP-growth, Apriori, improvised FP tree, etc. Fig : Applications of Web Usage Mining 10

11 CHAPTER 2 REVIEW OF LITERATURE The Web Mining is the application for data mining techniques to automatically discover and extract information from the web. Web usage mining has various application areas such as web pre-fetching, site reorganization and web personalization. Most important of web usage mining is discovering useful patterns form web log data by using pattern discovery technique such as Apriori,FP-Growth algorithm. Apriori algorithm for weblog mining is a well known technique.many algorithms are already existing for generating frequent access patterns from the access paths Eg. Apriori Algorithm, FP-Tree Algorithm, etc. But these Algorithms will take more database scans for generating user access patterns. These algorithms will take more time and more memory. It adds the property of the user ID during every step of producing the candidate set and every step of scanning the database to decide about whether an item in the candidate set should be used to produce next candidate set. The algorithm reduces the size of candidate set in order to reduce the number of database scanning. 2.1 Apriori Algorithm It searches for large item-sets during its initial database pass and uses its result as the seed for discovering other large datasets during subsequent passes. Rules having a support level above the minimum are called large or frequent item-sets and those below are called small item-sets. The algorithm is based on the large item-set property which states: Any subset of a large item-set is large and any subset of frequent item set must be frequent. Since the Algorithm uses prior knowledge of frequent item set it has been given the name Apriori. It is an 11

12 iterative level wise search Algorithm, where k item-sets are used to explore (k+1)- item-sets. The system operates in the following three modules. Preprocessing module. Apriori or FP Growth Algorithm Module. Association Rule Generation. Results. The pre-processing module converts the log file, which normally is in ASCII format, into a database like format, which can be processed by the Apriori algorithm. Apriori implements level-wise search using frequent item property and can be additionally optimized. Apriori is the simplest algorithm which is used for mining of frequent patterns from the transactional database. Advantages: Uses large item set properly. Easily parallelized. Easy to Implement. Disadvantages: It is costly to handle a huge number of candidate sets. It is tedious to repeatedly scan the database and check a large set of candidates by pattern matching, which is especially true for mining long patterns. The Apriori algorithm is given below: Lk: Set of frequent item sets of size k (with min support) Ck: Set of candidate item set of size k (potentially frequent item sets) L1 = {frequent items}; for (k = 1; Lk!=Æ; k++) do Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in 12

Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return Èk Lk Following is the flowchart for apriori algorithm: Fig: 2.

13 Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return Èk Lk Following is the flowchart for apriori algorithm: Fig: Apriori algorithm flowchart. 2.2 FP-Growth Algorithm: FP tree is a compact data structure that stores important and quantitative information about frequent patterns. The main components of FP tree are: It consists of one root labelled as root, a set of item prefix sub-trees as the children of the root, and a frequent-item header table. Each node in the item prefix sub-tree consists of three fields: item-name, count, and node-link, where itemname registers which item this node represents, count registers the number of transactions represented by the portion of the path reaching this node, and node- 13

14 link links to the next node in the FP tree carrying the same item-name, or null if there is none. Each entry in the frequent-item header table consists of two fields, item-name and head of node link, which points to the first node in the FP-tree carrying the item-name. Second, an FP-tree-based pattern-fragment growth mining method is developed, which starts from a frequent length-1 pattern (as an initial suffix pattern), examines only its conditional-pattern base (a sub-database which consists of the set of frequent items co-occurring with the suffix pattern), constructs its (conditional) FP-tree, and performs mining recursively with such a tree. The pattern growth is achieved via concatenation of the suffix pattern with the new ones generated from a conditional FP-tree. Since the frequent item set in any transaction is always encoded in the corresponding path of the frequentpattern trees, pattern growth ensures the completeness of the result. FP-growth, is used for efficient mining of frequent patterns in large databases. Algorithm of FP-Growth: Input: A database DB, represented by FP-tree constructed and a minimum support threshold. Output: The complete set of frequent patterns. Method: call FP-growth(FP-tree, null). Procedure FP-growth(Tree, a) { 1) if Tree contains a single prefix path then // Mining single prefix-path FP-tree { 2) let P be the single prefix-path part of Tree; 3 let Q be the multipath part with the top branching node replaced by a null root; 4) for each combination (denoted as ß) of the nodes in the path P do 5) generate pattern ß a with support = minimum support of nodes in ß; 6 )let freq pattern set(p) be the set of patterns so generated;} 7) else let Q be Tree; 14

15 8) for each item ai in Q do { // Mining multipath FP-tree 9) generate pattern ß = ai a with support = ai.support; 10) construct ß s conditional pattern-base and then ß s conditional FP-tree Tree ß; 11) if Tree ß Ø then 12)call FP-growth(Tree ß, ß); 13) let freq pattern set(q) be the set of patterns so generated;} 14) return(freq pattern set(p) freq pattern set(q) (freq pattern set(p) freq pattern set(q)))} When the FP-tree contains a single prefix-path, the complete set of frequent patterns can be generated in three parts: the single prefix-path P, the multipath Q, and their combinations (lines 01 to 03 and 14). The resulting patterns for a single prefix path are the enumerations of its sub paths that have the minimum support (lines 04 to 06). Thereafter, the multipath Q is defined (line 03 or 07) and the resulting patterns from it are processed (lines 08 to 13). Finally, in line 14 the combined results are returned as the frequent patterns found. Advantages: Uses Divide and conquer strategy. Uses Compact data structure. Eliminates repeated database scan. It is faster than other association mining algorithms. The algorithm reduces the total number of candidate item sets by producing a compressed version of the database in terms of an FP tree. Disadvantages: FP tree may not fit in memory. FP tree is expensive to build. 15

16 CHAPTER 3 EXISTING SYSTEM The existing system uses Apriori algorithm which uses iterative level wise search. It is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. This increases the execution time. 3.1 Input System Web Log Files A log file is a file in which every page request made to the web server is recorded. IP address of the computer making the request. User ID, (this field is not used in most cases). Date and time of the request. Size of the file transferred. Referring URL, that is, the URL of the page which contains the link that generated the request. Name and version of the browser being used. Fig : Web log files 16

The output system mainly focuses on generation of reports.

17 3.1.2 Output System Web log files can be used to reconstruct the user navigation sessions within the site from which the log data originates. The output system mainly focuses on generation of reports. These reports act as : Source of information required (Personalization). Permanent hard copy of the results. Fig : Data extracted from web log files 17

18 CHAPTER 4 PROPOSED SYSTEM The major drawbacks of the existing system are high execution time and excess memory usage. In FP growth algorithm, it takes more time for recursive calls and is good only when user access paths are common. Also it consumes more memory. Thus we propose a combination of FP Growth and Apriori algorithm to make the most of the all the advantages of both these algorithms and efficiently overcome the drawbacks of existing system. Modules: 1. Manage Users:- In this module admin manages the users. Which user is regular and which is not. 2. Manage Web log File:- In this module admin manages the usage of the users like which user visits which links and pages. 3. Data Preprocessing:- In this module system will remove unwanted data like less visited links and pages. 4. Pattern discovery (Apriori Algorithm):- In this module, the system applies Apriori algorithm on the web log file. 5. Pattern Analysis:- In this module system predict that the user is interested in which domain of interest. 6. Result:- This module provides the links which will satisfy users requirements(which will very useful to the users). Project Significance Generally, this project will produce the useful finding for analyzing the Web usage pattern for ELearning: 18

19 This study will become the first step for the analyzing E-Learning portal by applying Web usage mining approach with basic Association Rules Apriori algorithm and FP growth Algorithm. i. The outcomes from this study can be used by the Web administrator in order to plan necessary improvement, enhancement and valuable actions to the E-Learning portal. ii. The implementation of Web usage mining process for E-Learning portal may becomes the guide line for the system development purposes. 19

20 CHAPTER 5 DESIGN DETAILS 5.1 System Development Life Cycle: The System Development Life Cycle is the process of developing information systems through investigation, analysis, design, implementation, and maintenance[7]. The System Development Life Cycle (SDLC) is also known as Information Systems Development or Application Development. Fig SDLC Fig Gantt Chart for SDLC 20

21 5.2 Steps involved in the System Development Life Cycle: Below are the steps involved in the System Development Life Cycle. Each phase within the overall cycle may be made up of several steps. Step 1: Software Concept The first step is to identify a need for the new system. This will include determining whether a business problem or opportunity exists, conducting a feasibility study to determine if the proposed solution is cost effective, and developing a project plan. This process may involve end users who come up with an idea for improving their work. Ideally, the process occurs in tandem with a review of the organization's strategic plan to ensure that IT is being used to help the organization achieve its strategic objectives. Management may need to approve concept ideas before any money is budgeted for its development. Step 2: Requirements Analysis Requirements analysis is the process of analyzing the information needs of the end users, the organizational environment, and any system presently being used, developing the functional requirements of a system that can meet the needs of the users. Also, the requirements should be recorded in a document, , user interface storyboard, executable prototype, or some other form. The requirements documentation should be referred to throughout the rest of the system development process to ensure the developing project aligns with user needs and requirements. Professionals must involve end users in this process to ensure that the new system will function adequately and meets their needs and expectations. Step 3: Architectural Design 21

22 After the requirements have been determined, the necessary specifications for the hardware, software, people, and data resources, and the information products that will satisfy the functional requirements of the proposed system can be determined. The design will serve as a blueprint for the system and helps detect problems before these errors or problems are built into the final system. Professionals create the system design, but must review their work with the users to ensure the design meets users' needs. Step 4: Coding and Debugging Coding and debugging is the act of creating the final system. This step is done by software developer. Step 5: System Testing The system must be tested to evaluate its actual functionality in relation to expected or intended functionality. Some other issues to consider during this stage would be converting old data into the new system and training employees to use the new system. End users will be key in determining whether the developed system meets the intended requirements, and the extent to which the system is actually used. Step 6: Maintenance Inevitably the system will need maintenance. Software will definitely undergo change once it is delivered to the customer. There are many reasons for the change. Change could happen because of some unexpected input values into the system. In addition, the changes in the system could directly affect the software operations. The software should be developed to accommodate changes that could happen during the post implementation period. There is various software process models like:- Prototyping Model RAD Model 22

23 The Spiral Model The Waterfall Model The Iterative Model 5.3 Waterfall model Software process model deals with the model which we are going to use for the development of the project. There are many software process models available but while choosing it we should choose it according to the project size that is whether it is industry scale project or big scale project or medium scale project. Accordingly the model which we choose should be suitable for the project as the software process model changes the cost of the project also changes because the steps in each software process model varies.this software is build using the waterfall mode. This model suggests work cascading from step to step like a series of waterfalls. It consists of the following steps in the following manner. Fig Waterfall model Analysis Phase: To attack a problem by breaking it into sub-problems. The objective of analysis is to determine exactly what must be done to solve the problem. Typically, the system s logical elements (its boundaries, processes, and data) are defined during analysis. 23

24 Design Phase: The objective of design is to determine how the problem will be solved. During design the analyst s focus shifts from the logical to the physical. Data elements are grouped to form physical data structures, screens, reports, files, and databases. Coding Phase: The system is created during this phase. Programs are coded, debugged, documented, and tested. New hardware is selected and ordered. Procedures are written and tested. End-user documentation is prepared. Databases and files are initialized. Users are trained. Testing Phase: Once the system is developed, it is tested to ensure that it does what it was designed to do. After the system passes its final test and any remaining problems are corrected, the system is implemented and released to the user. All these phases are described with respect to the project in the rest of the document. 5.4 Data Flow Diagram A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system, modelling its process aspects. A DFD is often used as a preliminary step to create an overview of the system, which can later be elaborated. DFDs can also be used for the visualization of data processing (structured design). A DFD shows what kind of information will be input to and output from the system, where the data will come from and go to, and where the data will be stored. It does not show information about the timing of processes, or information about whether processes will operate in sequence or in parallel (which is shown on a flowchart). 24

25 fig User DFD Fig User UserCase 25

26 Fig User Flowchart 26

27 Phase 1: CHAPTER 6 IMPLEMENTATION PLAN Activity Description Effort in Phase 1 person weeks Deliverable P1-01 Requirement Analysis 2 weeks Requirement Gathering P1-02 Existing System Study & Literature 3 weeks Existing System Study & Literature P1-03 Technology Selection 2 weeks >NET P1-04 Modular Specifications 2 weeks Module Description P1-05 Design & Modeling 4 weeks Analysis Report Total 13 weeks Phase2: Activity Description Effort in person weeks Deliverable Phase 2 P2-01 Detailed Design 2 weeks LLD / DLD Document P2-02 UI and user interactions Included in UI document design above P2-03 Coding & Implementation 12 weeks Code Release P2-04 Testing & Bug fixing 2 weeks Test Report P2-05 Performance Evaluation 4 weeks Analysis Report P2-06 Release Included in System Release above Total 20 weeks Deployment efforts are extra 27

28 Gantt Charts The Gantt Chart shows planned and actual progress for a number of tasks displayed against a horizontal time scale. It is effective and easy-to-read method of indicating the actual current status for each of set of tasks compared to planned progress for each activity of the set. Gantt Charts provide a clear picture of the current state of the project. Planned Gantt Chart Table: Planned Gantt Chart 28

29 CHAPTER 7 ANALYSIS FEASIBILITY STUDY The very first phase in any system developing life cycle is preliminary investigation. The feasibility study is a major part of this phase. A measure of how beneficial or practical the development of any information system would be to the organization is the feasibility study.the feasibility of the development software can be studied in terms of the following aspects: Operational Feasibility. Technical Feasibility. Economical feasibility. OPERATIONAL FEASIBILITY The Application will reduce the time consumed to maintain manual records and is not tiresome and cumbersome to maintain the records. Hence operational feasibility is assured. TECHNICAL FEASIBILITY Minimum hardware requirements: 1.66 GHz Pentium Processor or Intel compatible processor. 1 GB RAM. Internet Connectivity. 80 MB hard disk space. ECONOMICAL FEASIBILTY Once the hardware and software requirements get fulfilled, there is no need for the user of our system to spend for any additional overhead. For the user, the Application will be economically feasible in the following aspects: The Application will reduce a lot of labour work. Hence the Efforts will be reduced. 29

30 Our Application will reduce the time that is wasted in manual processes. The storage and handling problems of the registers will be solved. 7.1 DETAILS OF HARDWARE AND SOFTWARE.NET Framework The.NET Framework is an environment for building, deploying, and running XML Web services and other applications. It is the infrastructure for the overall.net platform. The.NET Framework consists of three main parts: the common language runtime, the class libraries, and ASP.NET. Why C#? C# is the new language with the power of C++and the slickness of Visual Basic. It cleans up many of the syntactic peculiarities of C++ without diluting much of its flavour (thereby enabling C++ developers to transition to it with little difficulty).and its superiority over VB6 in facilitating powerful OO implementations is without question. 7.2 BACK-END: Microsoft SQL Server Business today demands a different kind of data management solution. Performance, scalability, and reliability are essential, but businesses now expect more from their key IT investment. SQLServer 2005 exceeds dependability requirements and provides innovative capabilities that increase employee effectiveness, integrate heterogeneous IT ecosystems, and maximize capital and operating budgets. SQL Server 2005 provides the enterprise data management platform your organization needs to adapt quickly in a fast-changing environment.with the lowest implementation and maintenance cost in the industry, SQL Server 2005 delivers repaid return on your 30

31 data management investment. SQL Server 2005 supports the rapid development of enterprise-class business application that can give your company a critical competitive advantage. Benchmarked for scalability, speed, and performance, SQL Server 2005 is a fully enterprise-class database product, providing core support for Extensible Markup Language (XML) and Internet queries. 31

32 CHAPTER 8 CONCLUSION Thus the proposed work is an efficient algorithm for generating frequent access patterns from the access paths of the users. This algorithm is optimized to take less time compared to the existing algorithms and store the access paths in the compressed format. The main aim of this algorithm is to reduce execution time and memory utilization as compared to the existing algorithms viz. Apriori algorithm. The frequent access patterns show the sequence of web pages which are frequently navigated by the user. The proposed Algorithm is not only generating any candidate sets, but also more number of patterns will be generated, due to this the number of tree traversals will be more. Information content on the WWW is increasing at an exponential rate and it is not surprising to find users having difficulty in navigation and finding relevant information. Hence, the e-commerce site developers find it difficult to observe potential customers or web site structure. We are thus making an attempt to improvise the existing algorithms and bring web mining to a new level. 32

33 References [1]B.Santhosh Kumar, K.V.Rukmani, Implementation of Web Usage Mining Using Apriori and FP Growth Algorithm, Int. J. of Advanced Networking and Applications, Volume: 01, Issue: 06, (2010), [2]Mishra Rahul, ChoubeyAbha, Discovery of Frequent Patterns from Web Log Data by using FP-Growth Algorithm for Web Usage Mining, International Journal of Advanced Research in Computer Science and Software Engineering, Vol.2, pp ,2012. [3]Han J., Pei J., Yin Y. and Mao R., Mining frequent patterns without candidate generation: A frequent-pattern tree approach Data Mining and Knowledge Discovery, [4]Baglioni M., Ferrara U., Romei A., Ruggieri S., and Turini F., (2003). Preprocessing and Mining Web Log Data for Web Personalization. In Proceedings of the 8th Italian Conference on Artificial Intelligence, LNCS Vol. 2829, pp

mctrgit International Conference on Advances in Computing and Information Technology

mctrgit International Conference on Advances in Computing and Information Technology ICACIT 2014 WEB USAGE MINING USING APRIORI AND FP GROWTH ALGORITHM Girija Patil 1, Priyanka Patkar 2, Aanum Shaikh 3,