DATA MINING - 1DL105, 1DL111

Similar documents
DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2017

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

COMP5331: Knowledge Discovery and Data Mining

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Bruno Martins. 1 st Semester 2012/2013

Searching the Deep Web

Approaches to Mining the Web

Part I: Data Mining Foundations

Searching the Deep Web

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval Spring Web retrieval

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Web Structure Mining using Link Analysis Algorithms

Title: Artificial Intelligence: an illustration of one approach.

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval and Web Search

Competitive Intelligence and Web Mining:

Overview of Web Mining Techniques and its Application towards Web

CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS

DATA MINING II - 1DL460

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Web Crawling As Nonlinear Dynamics

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

INTRODUCTION. Chapter GENERAL

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Search Engines. Information Retrieval in Practice

Searching the Web What is this Page Known for? Luis De Alba

Today we show how a search engine works

Information Retrieval

Information Retrieval May 15. Web retrieval

DATA MINING II - 1DL460

SEARCH ENGINE INSIDE OUT

A Review Paper on Web Usage Mining and Pattern Discovery

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Creating a Classifier for a Focused Web Crawler

Abstract. 1. Introduction

Information Retrieval

Improving Relevance Prediction for Focused Web Crawlers

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

CS6200 Information Retreival. The WebGraph. July 13, 2015

Web Crawlers Detection. Yomna ElRashidy

An Introduction to Search Engines and Web Navigation

Web Usage Mining using ART Neural Network. Abstract

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

The Topic Specific Search Engine

Information Retrieval

Final Exam DATA MINING I - 1DL360

Chapter 6: Information Retrieval and Web Search. An introduction

CHAPTER-27 Mining the World Wide Web

Collection Building on the Web. Basic Algorithm

Introduction to Data Mining

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Chapter 27 Introduction to Information Retrieval and Web Search

DATA MINING II - 1DL460

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

THE STUDY OF WEB MINING - A SURVEY

Web Data mining-a Research area in Web usage mining

Introduction to Data Mining and Data Analytics

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

Web Usage Mining: A Research Area in Web Mining

Design and Implementation of A Web Mining Research Support System. A Proposal. Submitted to the Graduate School. of the University of Notre Dame

A Survey on Web Information Retrieval Technologies

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Information Retrieval II

Lecture 8: Linkage algorithms and web search

Finding Neighbor Communities in the Web using Inter-Site Graph

Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns

Module 1: Internet Basics for Web Development (II)

Information Retrieval. Lecture 10 - Web crawling

Analysis of Link Algorithms for Web Mining

Big Data Analytics CSCI 4030

Detecting Spam Web Pages

Information Retrieval CSCI

Link Analysis in Web Mining

Seek and Ye shall Find

An Adaptive Approach in Web Search Algorithm

Pattern Classification based on Web Usage Mining using Neural Network Technique

FILTERING OF URLS USING WEBCRAWLER

Ranking Algorithms based on Links and Contentsfor Search Engine: A Review

Link Analysis and Web Search

THE WEB SEARCH ENGINE

Survey on Web Structure Mining

Transcription:

1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden

2 Introduction to Data Mining: Web Mining (slides + supplemental articles) ref book (used for slides): Data mining / Dunham Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden

3 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web Structure Mining Web Usage Mining

4 Web Mining Issues Size >350 million pages (1999) Grows at about 1 million pages a day Google indexes 3 billion documents More recent figures: According to a 2001 study, there were more than 550 billion documents (approximately 7,500 terabytes of data) on the Web, mostly in the "invisible web", or deep web. A study, dated January 2005, queried the Google, MSN, Yahoo!, and Ask Jeeves search engines with search terms from 75 different languages and determined that there were over 11.5 billion web pages in the publicly indexable Web, also termed the the surface web. Diverse types of data

5 Web data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies

Web Mining Taxonomy 6 Modified from [zai01]

7 Web content mining Extends work of basic search engines Search engines IR application Crawlers Indexing Profiles Link analysis Text mining functions (from basic to advanced) Keyword Term associations Similarity search (between query and document) Classification and clustering Natural language processing

8 Crawlers Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional crawler visits entire Web (?) and replaces index Periodic crawler visits portions of the Web and updates subset of index Incremental crawler selectively searches the Web and incrementally modifies index Focused crawler visits pages related to a particular subject

9 Focused crawler Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages to based on crawler and distiller scores. Classifier to related documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

10 Focused crawler

11 Context focused crawler Context Graph: Context graph created for each seed document. Root is the seed document. Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself. Approach: Construct context graph and classifiers using seed documents as training data. Perform crawling using classifiers and context graph created.

12 Context graph

13 Virtual web view Multiple Layered DataBase (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels.

14 Personalization Web access or contents tuned to better fit the desires of each user. Manual techniques identify user s preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles.

15 Web structure mining Mine structure (links, graph) of the Web Techniques PageRank CLEVER HITS Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages.

16 PageRank Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. PR(p) = c (PR(1)/N 1 + + PR(n)/N n ) PR(i): PageRank for a page i which points to target page p. N i : number of links coming out of page i c: is a value between 0 and 1 used for normalization

17 CLEVER Identify authoritative and hub pages. Authoritative Pages : Highly important pages. Best source for requested information. Hub Pages : Contain links to highly important pages.

18 HITS Hyperlink-Induced Topic Search Based on a set of keywords, find set of relevant pages R. Identify hub and authority pages for these. Expand R to a base set, B, of pages linked to or from R. Calculate weights for authorities and hubs. Pages with highest ranks in R are returned.

19 HITS algorithm

20 Web usage mining Performs mining on web usage data or web logs (clickstreams) Examined both from a server Uncover info about site where service reside Can e.g. improve design... and a client perspective Uncovers info about user or group Can e.g. improve prefeching and caching Applications of web usage mining Personalization Improve structure of a site s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising)

21 Web usage mining activities Preprocessing Web log Cleanse Remove extraneous information Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Count patterns that occur in sessions Pattern is sequence of pages references in session. Similar to association rules Transaction: session Itemset: pattern (or subset) Order is important Pattern Analysis

22 Web Mining: Content Structure Usage Association analysis in web mining Frequent patterns of sequential page references in Web searching. Uses: Caching Clustering users Develop user profiles Identify important pages

23 Web usage mining issues Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Security, privacy, and legal issues

24 Web log cleansing Replace source IP address with unique but non-identifying ID. Replace exact URL of pages referenced with unique but nonidentifying ID. Delete error records and records containing not page data (such as figures and code)

25 Sessionizing Divide Web log into sessions. Two common techniques: Number of consecutive page references from a source IP address occurring within a predefined time interval, such as 30 min (empirical studies show 25,5 min). All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.

26 Data structures Keep track of patterns identified during Web usage mining process Common techniques: Trie Suffix Tree Generalized Suffix Tree WAP Tree

27 Trie vs. Suffix tree Trie: Rooted tree Edges labeled which character (page) from pattern Path from root to leaf represents pattern. Suffix tree: Single child collapsed with parent. Edge contains labels of both prior edges.

28 Trie and Suffix tree

29 Generalized Suffix tree Suffix tree for multiple sessions. Contains patterns from all sessions. Maintains count of frequency of occurrence of a pattern in the node. WAP Tree: Compressed version of generalized suffix tree

30 Types of patterns Algorithms have been developed to discover different types of patterns. Properties: Ordered Pages (characters) must occur in the exact order in the original session. Duplicates Duplicate pages are allowed in the pattern. Consecutive All pages in pattern must occur consecutive in given session. Maximal Not subsequence of another pattern.

31 Pattern types Association rules None of the properties hold (no order, no duplicates, no consecutive or maximal patterns) Episodes Only ordering holds Sequential patterns Ordered and maximal Forward sequences Backlinks and reloads eliminated Ordered, consecutive, and maximal Maximal frequent sequences Support calculated in reference to length of sequence, i.e. no of clicks All properties hold

32 Episodes Partially ordered set of pages Serial episode totally ordered with time constraint Parallel episode partial ordered with time constraint General episode partial ordered with no time constraint

33 DAG for Episode