: Semantic Web (2013 Fall)

Similar documents
Introduction to the Semantic Web

The Semantic Web Vision

60-538: Information Retrieval

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Extracting knowledge from Ontology using Jena for Semantic Web

Mining Web Data. Lijun Zhang

USC Viterbi School of Engineering

CSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601

SRM UNIVERSITY. : Batch1: TP1102 Batch2: TP406

Jumpstarting the Semantic Web

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

CS47300: Web Information Search and Management

Databases 2 (VU) ( )

COMP9321 Web Application Engineering

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

COMP9321 Web Application Engineering

Semantic Website Clustering

Internet Client-Server Systems 4020 A

Oleksandr Kuzomin, Bohdan Tkachenko

INTRODUCTION. Chapter GENERAL

21. Search Models and UIs for IR

Introduction to Databases Fall-Winter 2009/10. Syllabus

The Data Web and Linked Data.

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

CS506/606 - Topics in Information Retrieval

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Exam IST 441 Spring 2014

Introduction to Text Mining. Hongning Wang

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval. Lecture 9 - Web search basics

Knowledge Representation and Semantic Web

Knowledge Representation and Semantic Web

AutoFocus, an Open Source Facet-Driven Enterprise Search Solution

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Information Retrieval and Web Search

Welcome to INFO216: Advanced Modelling

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Semantic Web Applications and the Semantic Web in 10 Years. Based on work of Grigoris Antoniou, Frank van Harmelen

Proposal for Implementing Linked Open Data on Libraries Catalogue

Research Topics in Information Retrieval

Introduction to Data Mining

Information Retrieval Spring Web retrieval

CAS CS 460/660 Introduction to Database Systems. Fall

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

CS290N Summary Tao Yang

Part I: Data Mining Foundations

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

INFS 2150 (Section A) Fall 2018

Introduction & Administrivia

Towards the Semantic Desktop. Dr. Øyvind Hanssen University Library of Tromsø

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Improving Adaptive Hypermedia by Adding Semantics

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Semantic Web Technologies

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

INF 315E Introduction to Databases School of Information Fall 2015

Social Networks 2015 Lecture 10: The structure of the web and link analysis

CS50 Quiz Review. November 13, 2017

Beyond Content-Based Retrieval:

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

CS-490WIR Web Information Retrieval and Management. Luo Si

Lecture 27: Learning from relational data

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Introduction to Semantic Web

Introduction to Databases Fall-Winter 2010/11. Syllabus

Introduction to Information Retrieval. Hongning Wang

Chapter 6: Information Retrieval and Web Search. An introduction

Semantic Web Systems Introduction Jacques Fleuriot School of Informatics

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

Prof. Dr. Christian Bizer

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

DATA MINING II - 1DL460. Spring 2017

Introduction. October 5, Petr Křemen Introduction October 5, / 31

Information Retrieval

Knowledge Discovery and Data Mining 1 (KU)

KNOW At The Social Book Search Lab 2016 Suggestion Track

Using the Penn State Search Engine

A Formal Definition of RESTful Semantic Web Services. Antonio Garrote Hernández María N. Moreno García

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Introduction to Bioinformatics

Introduction to Bioinformatics

Information Retrieval. Lecture 10 - Web crawling

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

Semantic-Based Web Mining Under the Framework of Agent

Welcome to INFO216: Advanced Modelling

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

p. 2 Copyright Notice Legal Notice All rights reserved. You may NOT distribute or sell this report or modify it in any way.

How A Website Works. - Shobha

Web Information System Design. Tatsuya Hagino

Lecture 0: Overview of cs1106/cs6503

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Introduction to CS 4604

COMP6217 Social Networking Technologies Web evolution and the Social Semantic Web. Dr Thanassis Tiropanis

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Mining Social Network Graphs

Transcription:

03-60-569: Web (2013 Fall) University of Windsor September 4, 2013

Table of contents 1 2 3 4 5

Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet The web relies on three mechanisms to make these resources available to audience: A uniform naming for locating resources on the web (e.g., URIs). Protocols, for access to named resources over the web (e.g., HTTP). Hypertext, for easy navigation among resources (e.g., HTML). What we use todays web for Browse (e.g., our course web site) Search (e.g., Google) Online social networking (facebook, twitter,...)...

Internet and the Web Internet: Global system of interconnected computer networks that use the standard Internet Protocol Suite (TCP/IP) Internet is a more general term Includes physical aspect of underlying networks and mechanisms such as email, FTP, HTTP Web: Associated with information stored on the Internet Refers to a broader class of networks, i.e. Web of English Literature Both Internet and web are networks 1 2 6 2 3 3 5 4 3 1 1 1

What are the challenging tasks for the Web Search Engine Crawling: collect the web pages; Indexing: There are tools that help us index documents, and provide relevant answers to queries. There are many matches. What matters are the top a few pages Ranking: Important pages are returned first. How to decide the importance? Incoming links. Important link weights more. How to decide the importance? Detect similar pages How do we detect similar pages (what do we mean by similar page)? When there are billions of pages, what is the algorithm? Online Social Networks. What are the network structure. E.g., the clustering coefficient, how many triangles are there in the network. How to handle the semantics Latent Indexing

Web characteristics 10 5 10 4 Graph structure of the Web Power laws, small world, bow-tie structure Users and queries (10)Epinion 10 3 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4

Web crawling The first step to build a search engine Crawler architecture and policies, breadth first and depth first crawling; Crawling OSNs and Deep web Near duplicate detection Detect similar pages among vast number of documents Have to be efficient Use shingles to represent pages Jaccard similarity

Search Engines The process to build a search engine Use cene search engine API Page Rank Algorithm The algorithm used by Google and other search engines; Have many engineering challenges

Problem with search engined: High recall, low precision Searching for semantic web in google matches estimated. 7,610,000 pages in 2009 14,000,000 pages in 2011 Only first a few pages will be viewed Need to be intelligent Results are highly sensitive to vocabulary If you search for faculty member, you may not be able to find professor. Results are single web pages You will not be able to find professors who are teaching semantic web in Canada Current search engines can not integrate related web pages in different places

Approach One: Latent Indexing Words used in the same context tend to have similar meanings Capture the semantic meanings that are otherwise latent in text Use SVD technique (Singular Value Decomposition) from linear algebra Ideas developed in the seventies, applied widely in information retrieval

Approach Two: Represent web content in a form that is more machine-processable. Use intelligent techniques to take advantage of these representations. Related technologies Explicit Metadata (XML) Ontologies Logic and Inference Web service Data integration Intelligent software agents

HTML <h1> A g i l i t a s Physiotherapy Centre </h1> Welcome to the home page of the A g i l i t a s Physiotherapy Centre. Do yo K e l l y Townsend ( our l o v e l y s e c r e t a r y ) and Steve Matthews take care o <h2> C o n s u l t a t i o n hours </h2> Mon 11am 7pm <br> Tue 11am 7pm <br> Wed 3pm 7pm <br> Thu 11am 7pm <br> F r i 11am 3pm <p> But note t h a t we do not o f f e r c o n s u l t a t i o n during the weeks of the <a h r e f =... > State Of O r i g i n </a> games. Markups are for formatting or presentation. Web contents are for humans instead of programs

Problems with HTML Humans have no problem with this. Machines (software agents) do. How to distinguish therapists from the secretary; How to determine exact consultation hours; They would have to follow the link to the State Of Origin games to find when they take place. One approach is to build intelligent program to extract the relevant information

A better representation-xml Explicit Metadata This representation is far more easily processable by machines Metadata: data about data. Metadata capture part of the meaning of data does not rely on text-based manipulation, but rather on machine-processable metadata

Meta data is not enough E.g., how can machine know that < secretary > and < therapist > are people? We need to define that < secretary > is a subclass of < people > How can we specify the rule that < secretary > can not be a < therapist >? We need to say that classes < secretary > and < therapist > are disjoint.

Topics Web as a graph, search basics Web analysis, degrees, diameter, scale-free network, near duplicate detection, page rank algorithm, Mining online social network Introduction, RDF, RDFS, OWL, ontology engineering;

Final Exam: 40% Presentation: 10% Class participation: 10% One project 40% Paper presentation 10% (before week 7) Papers and approximate presentation time are listed on course web site. When you are ready to present, let me know the time and the paper you selected. When your presentation slides is ready, send me the slides. Each presentation is 30 minutes long. Class participation10% Active in classes, paper presentation

Project There are five small tasks Build a crawler to collect web pages (8%) Say all the pages in the domain of uwindsor.ca Construct a search engine using cene (8%) Compute page rank (8%) Find near duplicate pages (8%) Compute clustering coefficient (8%) Lectures on these topics are finished within the first a few weeks, so that you can start the project soon; Naive implementation is straightforward. You can use any programming language you prefer. The submission deadline for the first two is November 10. The due date for the last two subtasks are on the last day of the class. You need to submit the source code as well as the ppt slides to be presented in the class. On the last day of the class, you will demonstrate and explain your project in front of the class. You can get high marks only if your program can process big data. Student presentations can address the details on these algorithms, especially the scalability techniques.

Tentative Schedule Week 1-4: Web as a graph, Crawling, Search Engine construction, paper presentations. Week 5-7: Link analysis (page rank algorithm), near duplicate detection, paper presentations. Week 8-10: (RDF, RDFS, OWL, query language) Week 11-12: Latent semantic (Eigenvectors, singular value decomposition), Exam Web graph basics,, Web and OSN mining algorithms. Page rank, similar pages, singular value decomposition, latent semantic. Test topics covered in lectures Test issues involved in projects Concepts, Logics, and Algorithms Student presentations will not be tested

Reading materials MMD Anand Rajaraman and Jeff Ullman, Mining of massive datasets, 2012. IR Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze,Introduction to Information Retrieval, Cambridge University Press. 2008. Chapters 18, 19, 20, 21 only. SW Grigoris Antoniou, Frank van Harmelen,A Web Primer, MIT Press