DATA MINING II - 1DL460. Spring 2017

Similar documents
DATA MINING II - 1DL460. Spring 2014"

DATA MINING - 1DL105, 1DL111

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Mining Web Data. Lijun Zhang

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

Information Retrieval Spring Web retrieval

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval May 15. Web retrieval

Overview of Web Mining Techniques and its Application towards Web

Mining Web Data. Lijun Zhang

Collection Building on the Web. Basic Algorithm

CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS

INTRODUCTION. Chapter GENERAL

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

An Approach To Web Content Mining

Information Retrieval

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Competitive Intelligence and Web Mining:

FILTERING OF URLS USING WEBCRAWLER

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval and Web Search

Part I: Data Mining Foundations

Approaches to Mining the Web

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

DATA MINING II - 1DL460

COMP5331: Knowledge Discovery and Data Mining

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Information Retrieval

Web Usage Mining using ART Neural Network. Abstract

Web Crawlers Detection. Yomna ElRashidy

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Semantic Web Lecture Part 1. Prof. Do van Thanh

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Search Engines. Information Retrieval in Practice

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Searching the Deep Web

Nitin Cyriac et al, Int.J.Computer Technology & Applications,Vol 5 (1), WEB PERSONALIZATION

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

CS6200 Information Retreival. Crawling. June 10, 2015

SEARCH ENGINE INSIDE OUT

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Web Structure Mining using Link Analysis Algorithms

Semantic Clickstream Mining

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

DATA MINING II - 1DL460

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Module 1: Internet Basics for Web Development (II)

Web Usage Mining: A Research Area in Web Mining

Chapter 2 BACKGROUND OF WEB MINING

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Information Retrieval

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Searching the Deep Web

Chapter 3 Process of Web Usage Mining

Adaptive and Personalized System for Semantic Web Mining

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

History and Backgound: Internet & Web 2.0

Proposal for Implementing Linked Open Data on Libraries Catalogue

A Review Paper on Web Usage Mining and Pattern Discovery

12 Web Usage Mining. With Bamshad Mobasher and Olfa Nasraoui

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns

WebGUI & the Semantic Web. William McKee WebGUI Users Conference 2009

Searching the Web What is this Page Known for? Luis De Alba

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Creating a Classifier for a Focused Web Crawler

CS6200 Information Retreival. The WebGraph. July 13, 2015

How Does a Search Engine Work? Part 1

Chapter 27 Introduction to Information Retrieval and Web Search

Web Usage Mining for Web Personalization

Web Crawling As Nonlinear Dynamics

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

AN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

World Wide Web has specific challenges and opportunities

Web Data mining-a Research area in Web usage mining

The Data Web and Linked Data.

LIST OF ACRONYMS & ABBREVIATIONS

CHAPTER-27 Mining the World Wide Web

Unit 4 The Web. Computer Concepts Unit Contents. 4 Web Overview. 4 Section A: Web Basics. 4 Evolution

Seek and Ye shall Find

Link Analysis and Web Search

Seek and Ye shall Find

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE

Survey Paper on Web Usage Mining for Web Personalization

Semantic-Based Web Mining Under the Framework of Agent

Deep Web Crawling and Mining for Building Advanced Search Application

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Finding Neighbor Communities in the Web using Inter-Site Graph

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

A Survey on Web Personalization of Web Usage Mining

Lecture 8: Linkage algorithms and web search

Title: Artificial Intelligence: an illustration of one approach.

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

THE STUDY OF WEB MINING - A SURVEY

Transcription:

DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 17-02-09 1

Introduction to Data Mining: Web Mining (slides + supplemental articles) ref book (used for slides): Data mining / Dunham Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 17-02-09 2

Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web Structure Mining Web Usage Mining 17-02-09 3

Web Mining Issues Size >350 million pages (1999) Grows at about 1 million pages a day Google indexes 3 billion documents More recent figures: According to a 2001 study, there were more than 550 billion documents (approximately 7,500 terabytes of data) on the Web, mostly in the "invisible web", or deep web. A study, dated January 2005, queried the Google, MSN, Yahoo!, and Ask Jeeves search engines with search terms from 75 different languages and determined that there were over 11.5 billion web pages in the publicly indexable Web, also termed the the surface web. >25 billion pages (2009) in the indexable web (Worldwidewebsize.com) >45 billion pages (2017) in the indexable web (Worldwidewebsize.com) Estimates say size of deep web 400-500 times bigger than surface web Diverse types of data 17-02-09 4

Web data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies 17-02-09 5

Web Mining Taxonomy Modified from [zai01] 17-02-09 6

Web content mining Extends work of basic search engines Search engines IR application Crawlers Indexing Profiles Link analysis Text mining functions (from basic to advanced) Keyword Term associations Similarity search (between query and document) Classification and clustering Natural language processing 17-02-09 7

Web Crawlers Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional crawler visits entire Web(?) and replaces index Periodic crawler visits portions of the Web and updates subset of index Incremental crawler selectively searches the Web and incrementally modifies index Focused crawler visits pages related to a particular subject 17-02-09 8

Web crawler policies Web crawler behavior is the result of a combination of policies: a selection policy that states which pages to download, a re-visit policy that states when to check for changes to the pages, a politeness policy that states how to avoid overloading web sites, and a parallelization policy that states how to coordinate distributed web crawlers. 17-02-09 9

Web crawler applications Web search engines Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex, Ask,... One of three base technologies: crawling, indexing and querying (include ranking) I What are the other two and which is the most crucial 17-02-09 10

Web archiving Digital preservation Librarian look on the Web The biggest: Internet Archive Batch crawls Primarily collection of national websites Web crawler applications There are quite many and some are huge! (see the list of Web Archiving Initiatives at Wikipedia) Vertical search engines Data aggregating from many sources on certain topic E.g., apartment search, car search Web data mining To get data to be actually mined Usually using focused crawlers For example, opinion mining Or digests of current happenings on the Web (e.g. what music people listen to now) 17-02-09 11

Web crawler applications Web monitoring Monitoring sites/pages for changes and updates Detection of malicious web sites Typically a part of antivirus, firewall, search engine, etc. service Building a list of such web sites and inform a user about potential threat of visiting such Web site/application testing Crawl a web site to check a navigation through it, validity the links, etc. Regression/security/...testing a rich internet application (RIA) via crawling Checking different application states by simulating possible user interaction events (e.g., mouse click, time-out) Fighting crime! :) well, copyright violations Crawl to find (media) items under copyright or links to them Regular re-visiting suspicious web sites, forums, etc. Tasks like finding terrorist chat rooms also go here 17-02-09 12

Web crawler applications Web scraping Extracting particular pieces of information from a group of typically similar pages Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection and web data integration. When API to data is not available Interestingly, scraping might be more preferable even with API available as scraped data can often be more clean and up-to-date than data-via-api Web mirroring Copying of web sites Often hosting copies on different servers to ensure constant accessibility 17-02-09 13

Example of crawling www.it.uu.se/research 17-02-09 14

Web crawler architecture 17-02-09 15

Basic sequential crawler Architecture of sequential crawler: Seeds list of starting URLs The order of page visits determined by frontier data structure Stop condition (e.g. X no of pages fetched) Illustration taken from Ch.8 Web Crawling by Filippo Menczer in Bing Liu s Web Data Mining (Springer, 2007) 17-02-09 16

Focused crawler Focused crawler (when focusing on specific subject also called topical crawler): Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages based on crawler and distiller scores. Classifier relates documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score. 17-02-09 17

Focused crawler 17-02-09 18

Context focused crawler Context Graph: Context graph created for each seed document. Root is the seed document. Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself. Approach: Construct context graph and classifiers using seed documents as training data. Perform crawling using classifiers and context graph created. 17-02-09 19

Context graph 17-02-09 20

Virtual web view Multiple Layered DataBase (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels. 17-02-09 21

Semantic web Semantic web extends the network of hyperlinked human-readable web pages by inserting machine-readable metadata about pages and how they are related to each other. Enables automated agents to access the Web more intelligently and perform more tasks on behalf of users. The term "Semantic Web" was coined by Tim Berners-Lee. He defines the Semantic Web as "a web of data that can be processed directly and indirectly by machines. Many technologies proposed by the W3C are used in various contexts, particularly dealing with information in a limited and defined domain, and where sharing data is a common necessity, such as scientific research or data exchange among businesses. 17-02-09 22

Semantic web The term "Semantic Web" is often used more specifically to refer to the formats and technologies that enable it. Collecting, structuring and searching of linked data are enabled by technologies that provide a formal description of concepts, terms, and relationships within a given knowledge domain. HTML describes documents and the links between them while RDF, OWL, and XML, can describe arbitrary things such as people, meetings, or airplane parts. These technologies are specified as W3C standards and include: Resource Description Framework (RDF), a general method for describing information RDF Schema (RDFS) SPARQL, an RDF query language An extended dialect for scientific applications called SCISPARQL developed at UDBL, Uppsala University) Web Ontology Language (OWL), a family of knowledge representation languages Extensible Markup Language (XML) 17-02-09 23

Personalization in web content mining Web access or contents tuned to better fit the desires of each user. E.g. personalized pages, search results and recommendations Include the use of cookies, user accounts, access patterns, databases and more complex data mining techniques Manual techniques perform personalization based on user s registered preferences or classification of individuals based on profiles or demographics. Collaborative filtering recommends information (pages, music, books, etc) by identifying preferences based on high ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles. 17-02-09 24

Web structure mining Mine structure (links, graph) of the indexable Web Techniques PageRank (Google) HITS (IBM Clever project) Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages. 17-02-09 25

PageRank Used by Google PageRank was designed to increase the effectiveness of search engines and improve their efficiency Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it i.e. backlinks. Weighting is used to provide more importance to backlinks coming form important pages. A simplified expression for calculating the PageRank (PR): PR(p) = PR(1)/N 1 + + PR(n)/N n PR(i): PageRank for a page i which points to target page p. N i : number of links coming out of page i 17-02-09 26

HITS - Clever Developed by IBM Clever project Aims at identifying both authoritative pages and hub pages. Authoritative Pages : Highly important pages. Best source for requested information. Hub Pages : Contain links to highly important pages, i.e. authoritative pages. 17-02-09 27

HITS - Clever Hyperlink-Induced Topic Search (HITS) Based on a given set of keywords (found by a query using a search engine, SE), find a set of relevant pages R (the root set). Identify hub and authority pages for these. Expand R to a base set, B, of pages linked to or from R. Calculate weights for authorities and hubs. Pages with highest ranks in R are returned. 17-02-09 28

HITS algorithm 17-02-09 29

Web usage mining Performs mining on web usage data or web logs (clickstreams) Examined both from a server Uncover info about site where service reside Can e.g. improve design... and a client perspective Uncovers info about user or group Can e.g. improve prefetching and caching Applications of web usage mining Personalization (by e.g. tracking browsing behaviour) Improve structure of a site s web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales, advertising and recommendations of variuos kinds) 17-02-09 30

Preprocessing Web usage mining activities e.g. web logs (reformatting web log data) Cleansing and remove extraneous information, path completion and formatting User identification Session identification - sessionize, where session is a sequence of pages referenced by one user at a sitting. Pattern discovery Count patterns that occur in sessions Pattern is sequence of pages referenced in a session. For example association rules Transaction: session Itemset: pattern (or subset) For other type of patterns order can also be important Pattern analysis Interpretation of results of pattern discovery 17-02-09 31

Web usage mining issues Identification of exact user not possible. Due to proxy servers, client side caching, firewalls, ISP:s Cookies may help Exact sequence of pages referenced by a user not possible due to caching. Path completion algorithm can be applied Session not well defined Security, privacy, and legal issues 17-02-09 32

Web log cleansing Replace source IP address with unique but non-identifying ID. Replace exact URL of pages referenced with unique but nonidentifying ID. Delete error records and records containing not page data (such as figures and code) 17-02-09 33

Sessionizing Divide Web log into sessions. Two common techniques: Number of consecutive page references from a source IP address occurring within a predefined time interval, such as 30 min (empirical studies show 25,5 min). All consecutive page references from a source IP address where the interclick time is less than a predefined threshold. 17-02-09 34

Data structures in web usage mining Keep track of patterns identified during Web usage mining process Common techniques: Trie Suffix Tree Generalized Suffix Tree WAP Tree 17-02-09 35

Trie vs. Suffix tree Trie: Rooted tree Edges labeled with character (page) from pattern Path from root to leaf represents pattern. Suffix tree: Single child collapsed with parent. Edge contains labels of both prior edges. 17-02-09 36

Trie and Suffix tree 17-02-09 37

Generalized suffix tree A generalized suffix tree is a suffix tree for multiple sessions. Contains patterns from all sessions. Maintains count of frequency of occurrence of a pattern in the node. WAP Tree: Compressed version of generalized suffix tree 17-02-09 38

Pattern discovery Most obvious activity to uncovering traversal patterns, i.e. a set of pages visited by a user in a session Association rules can look at pages accessed together without considering order Finding what pages are accessed together Similar traversal patterns can be clustered together to provide clustering of users in contrast to clustering pages to identify similar pages 17-02-09 39

Types of patterns Algorithms have been developed to discover different types of patterns. Properties: Ordered pages (characters) must occur in the exact order as in the original session. Duplicates duplicate pages are allowed in the pattern. Consecutive all pages in pattern must occur contiguous in a given session. Maximal not subsequence of another pattern. Applications: Prefetching and caching applications: knowledge of contiguous page references frequently made can be useful predicting future references Site/page design: knowledge of frequent backward traversals can be used to improve design Maximal property mainly used to reduce number of meaningful patterns 17-02-09 40

Pattern types Association rules None of the properties hold (no order, no duplicates, no consecutive or maximal patterns) Episodes Only ordering holds Sequential patterns (as applied in web usage mining) Ordered and maximal Forward sequences Backlinks and reloads eliminated Ordered, consecutive, and maximal Maximal frequent sequences Support calculated in reference to length of sequence, i.e. no of clicks All properties hold 17-02-09 41

Episodes Partially ordered set of pages Serial episode totally ordered with time constraint Parallel episode partial ordered with time constraint General episode partial ordered with no time constraint 17-02-09 42

DAG for Episode Temporal ordering of events (page visits) where numbers indicate time step between nodes 17-02-09 43

Comparing properties of traversal patterns transactions or sessions time w. is windows of certain size customers or users 17-02-09 44