vi TABLE OF CONTENTS ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION iii xii xiii xiv 1 INTRODUCTION 1 1.1 WEB MINING 2 1.1.1 Association Rules 2 1.1.2 Association Rule Mining 3 1.1.3 Clustering 3 1.1.4 Classification 4 1.2 WEB MINING CATEGORIES 5 1.2.1 Web Content Mining 5 1.2.2 Web Structure Mining 6 1.2.3 Web Usage Mining 6 1.3 INFORMATION RETRIEVAL ON THE WEB 7 1.3.1 Web Search Environments 7 1.3.1.1 Ontology Web 8 1.3.1.2 Semantic Web 8 1.3.2 Information Searching on the Web 9 1.3.3 Web Information Retrieval Approaches 10 1.4 INFORMATION FILTERING SYSTEM 11 1.4.1 Content-Based System 12 1.4.2 Collaborative Filtering 12
vii 1.4.3 Content-Boosted Collaborative Filtering 12 1.4.4 Combining Content-Based and Collaborative Filters 13 1.5 PAGE RANKING IN WEB SEARCH 13 1.5.1 Key Documents Ranking 14 1.5.2 Related Documents Ranking 14 1.6 WEB PERSONALIZATION 15 1.6.1 Recommendation System 15 1.6.2 Personalized Recommendation System 17 1.7 SEARCH ENGINES 17 1.7.1 Ontology Mining Search Engine 18 1.7.2 Crawler-Based Search Engines 19 1.7.3 Human-Powered Directories 19 1.7.4 Hybrid Search Engines 20 1.8 PROPOSED WORK 20 1.8.1 Problem Definition 20 1.8.2 Research Focus 20 1.9 THESIS CONTRIBUTIONS 21 1.9.1 Contribution in Web Page Classification 21 1.9.2 Contribution in Retrieval of Relevant Web Pages 21 1.9.3 Contribution to preparation of User Profile from Web Log File 22 1.9.4 Contribution to Personalizing the Web 22 1.10 THESIS ORGANIZATION 23
viii 2. LITERATURE SURVEY 24 2.1 WEB PAGE RETRIEVAL PROCESS 24 2.2 KNOWLEDGE ACQUISITION FOR WEB PERSONALIZATION 24 2.3 USER PROFILE ANALYSIS 26 2.3.1 Works on User Profile Analysis 26 2.3.2 User behaviour Analysis 27 2.3.3 Cluster Analysis 28 2.4 WEB PAGE ANALYSIS 29 2.4.1 Classification of Web Pages 29 2.4.2 Works on Classification 31 2.4.3 Fuzzy Classification 33 2.5 ASSOCIATION RULE MINING 34 2.5.1 Works on Association Rule Mining 36 2.5.2 Fuzzy Association Rule Mining 37 2.6 RELEVANT INFORMATION RETRIEVAL 38 2.6.1 Works on Relevant Information 38 2.6.2 Web Page Ranking Algorithms 41 2.6.2.1 Hyper Search Algorithm 42 2.6.2.2 Hyperlink-Induced Topic Search (HITS) 42 2.6.2.3 PageRank 42 2.6.2.4 Trust Rank 43 2.7 WORKS ON PAGERANK ALGORITHMS 43 2.7.1 Web Page Filtering Process 44 2.7.2 Content-based system 45
ix 2.7.3 Collaborative Filtering System 45 2.7.4 Hybrid Filtering 46 2.7.5 Works on Filtering Process 46 2.8 INTELLIGENT PERSONALIZED RECOMMENDATION 47 2.8.1 Personalized Web Search 47 2.8.2 Works on Personalization 48 2.8.3 Works on Recommendation 49 2.9 PROPOSED WORK 51 3. SYSTEM ARCHITECTURE 53 3.1 USER INTERFACE 54 3.2 SEARCH ENGINE INTERFACES 54 3.3 WEB PAGES 54 3.4 FUZZY ASSOCIATION RULE GENERATOR 55 3.5 CLASSIFIED WEB PAGES 55 3.6 KNOWLEDGE ACQUISITION SYSTEM 56 3.7 DOMAIN EXPERT INTERFACE 56 3.8 RULE MANAGER 57 3.9 RULE BASE 57 3.10 USER PROFILE 57 3.11 USER PROFILES ANALYSIS MODULE 57 3.11.1 Feature selection 58 3.11.2 Classification 58 3.11.3 Clustering 58 3.12 RELEVANT INFORMATION EXTRACTION MODULE 59 3.12.1 Filtering 59
x 3.12.2 Page Ranking 59 3.13 RELEVANT WEB PAGES 60 3.14 WEB PERSONALIZATION AND RECOMMENDATION MODULE. 60 3.14.1 Fuzzy Temporal Association Rule Mining 60 3.15 THESIS CONTRIBUTION 61 4. USER PROFILE ANALYSIS 63 4.1 DATA PREPROCESSING 64 4.1.1 Data Set 64 4.1.2 Data Discretization for Preprocessing 65 4.1.3 Classification on Anova-T data Selection 65 4.1.3.1 Algorithm Steps 67 4.1.3.2 Pseudo Code for Anova-T Classifier 67 4.1.4 Fuzzy-D Discretization 69 4.1.4.1 Algorithm Steps 69 4.2 USER PROFILE CLUSTERING 70 4.2.1 Results and Discussion 73 4.3 WEBPAGE ANALYSIS SUBSYSTEM 74 4.3.1 Algorithm 75 4.3.2 Proposed Algorithm for Fuzzy Association Rule Mining 76 4.3.3 Results and Discussion 79 4.4 RELEVANT INFORMATION EXTRACTION 80 4.4.1 Rule Schema 80 4.4.2 Proposed rule discovery algorithm 81
xi 4.4.3 Filtering 82 4.4.4 Proposed Algorithm 84 4.4.5 Page Ranking Module 85 4.4.6 Proposed Algorithm 86 4.4.7 Results and Discussion 87 5. WEB PERSONALIZATION AND RECOMMENDATION 89 5.1 FUZZY TEMPORAL ASSOCIATION RULE MINING 90 5.1.1 Proposed Algorithm 91 5.1.2 Proposed Fuzzy Temporal Association Rule Mining Algorithm 92 5.1.3 Pseudo Code for FTA Rule Mining 93 5.1.4 Result and Discussion 94 6. CONCLUSIONS AND FUTURE ENHANCEMENTS 96 6.1 CONCLUSIONS 96 6.1.1 Web Page Classification 96 6.1.2 Retrieval of Relevant Web Pages 97 6.1.3 User Profile Preparation and its Analysis 97 6.1.4 Personalizing the Web 98 6.2 FUTURE ENHANCEMENTS 99 REFERENCES 100 LIST OF PUBLICATION 113 VITAE 114
xii LIST OF TABLES TABLE NO. TITLE PAGE NO. 4.1 Anova-T Residue Classifications on User Data 68 4.2 Fuzzy-D Discretization - Reduced Classification Error Report 70 4.3 User s Profile Analysis 71 4.4 Cluster Analysis of User Profiles 73 4.5 Ontology based Collaborative Filter Analysis 85
xiii LIST OF FIGURES FIGURE NO. TITLE PAGE NO. 3.1 System Architecture 53 4.1 Architecture for User Profile Analysis 63 4.2 Anova T Classification Method 66 4.3 Cluster Structure 72 4.4 Performance of Cluster Analysis 73 4.5 Web Page Analysis Using Fuzzy Association Rule Mining 74 4.6 Comparison of Classification Accuracies using Association Rules 75 4.7 Classification Accuracy of Proposed Fuzzy Association Rule Mining Algorithm 79 4.8 Precision and Recall Analysis Graph 80 4.9 System Architecture for Relevant Web Page Retrieval 81 4.10 System Overview 82 4.11 Architecture of Ontology Based Collaborative Filter 83 4.12 Relationship between Precision and Recall 87 4.13 Web Document Retrieval Analysis with respect to Time 88 5.1 System Architecture of Web Personalization and Recommendation Module 89 5.2 Performance Analysis of Proposed Recommendation System 94 5.3 Relevancy Measurement 95
xiv LIST OF ABBREVIATIONS ANOVA - Analysis of Variances FTARM - Fuzzy Temporal Association Rule Mining HITS - Hyperlink Induced Topic Search HTML - Hyper Text Markup Language HTTP - Hyper Text Transfer Protocol LODAP - Log Data Processor MVE - Minimum Volume Ellipsoid MSN - Microsoft Network PEBL - Positive Example Based Learning SVM - Support Vector Machine URI - Uniform Resource Identifier URL - Uniform Resource Locator WCM - Web Content Mining WSM - Web Structure Mining WWW - World Wide Web XHTML - Extensible Hyper Text Markup Language