Web Applications: Internet Search and Digital Preservation

Similar documents
Information Retrieval

Searching the Web for Information

deseo: Combating Search-Result Poisoning Yu USF

Information Retrieval May 15. Web retrieval

Information Retrieval Spring Web retrieval

How Does a Search Engine Work? Part 1

Mining Web Data. Lijun Zhang

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Engine Optimization (SEO)

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali

THE WEB SEARCH ENGINE

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

This tutorial has been prepared for beginners to help them understand the simple but effective SEO characteristics.

Mining Web Data. Lijun Zhang

On the Change in Archivability of Websites Over Time

Glossary of on line marketing terms

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

The PageRank Citation Ranking: Bringing Order to the Web

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Chapter 2: Literature Review

Search Engines. Charles Severance

Ranking Techniques in Search Engines

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

CS47300 Web Information Search and Management

SEO 1 8 O C T O B E R 1 7

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

URLs excluded by REP may still appear in a search engine index.

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Informa(on Retrieval

WEBSITES PUBLISHING. Website is published by uploading files on the remote server which is provided by the hosting company.

Lecture 8: Linkage algorithms and web search

SEO According to Google

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Search Engine Optimization (SEO) & Your Online Success

Collection Building on the Web. Basic Algorithm

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Searching the Web What is this Page Known for? Luis De Alba

Information Retrieval

搜索引擎优化. Search Engine Optimization 赵卫东博士复旦大学软件学院

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

DATA MINING II - 1DL460. Spring 2014"

Why is Search Engine Optimisation (SEO) important?

What s an SEO Strategy With Out Social Media?

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Web Structure Mining using Link Analysis Algorithms

6 WAYS Google s First Page

Security 08. Black Hat Search Engine Optimisation. SIFT Pty Ltd Australia. Paul Theriault

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

SEO. Definitions/Acronyms. Definitions/Acronyms

CS101 Lecture 30: How Search Works and searching algorithms.

An Archetype for Web Mining with Enhanced Topical Crawler and Link-Spam Trapper

Lec 8: Adaptive Information Retrieval 2

Blackhat Search Engine Optimization Techniques (SEO) and Counter Measures

Constructing Websites toward High Ranking Using Search Engine Optimization SEO

The Insanely Powerful 2018 SEO Checklist

Information Retrieval and Web Search

Objective Explain concepts used to create websites.

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Searching the Web [Arasu 01]

Introduction to Information Retrieval

Top-To-Bottom (And Beyond) On-Page Optimization Guidebook

1 GOOGLE NOW ALLOWS EDITING BUSINESS LISTINGS

A Survey on Web Information Retrieval Technologies

Effective Page Refresh Policies for Web Crawlers

Promoting Website CS 4640 Programming Languages for Web Applications

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Review of Meltmethod.com

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Business Forum Mid Devon. Optimising your place on search engines

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

FILTERING OF URLS USING WEBCRAWLER

Review of Wordpresskingdom.com

How to Get Your Web Maps to the Top of Google Search

SEO News. 15 SEO Fixes for Better Rankings. For SEO, marketing books and guides, visit

power up your business SEO (SEARCH ENGINE OPTIMISATION)

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

10 SEO MISTAKES TO AVOID

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Table of contents. 1. Backlink Audit Summary...3. Marketer s Center. 2. Site Auditor Summary Social Audit Summary...9

Visualizing Thumbnails Of Archived Web Pages

Abhishek Dixit, Mukesh Agarwal

Below, we will walk through the three main elements of the algorithm, which include Domain Attributes, On-Page and Off-Page factors.

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

The Geeks Guide To SEO

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

SEO basics. Chris

Review of Cormart-nigeria.com

Transcription:

CS 312 Internet Concepts Web Applications: Internet Search and Digital Preservation Dr. Michele Weigle Department of Computer Science Old Dominion University mweigle@cs.odu.edu http://www.cs.odu.edu/~mweigle/cs312-f11/ 1 Outline!!Generic Searching!!How Google Works!!Digital Preservation»! Internet Archive»! Memento»! Archive Facebook 2

Generic Search Engines!!Some very popular ones»! Google»! Bing»! Yahoo!!!Some common characteristics»! Huge set of results»! Extremely prompt retrieval»! Organized retrieval results How do they achieve such amazing results? 3 How Do Search Engines Work?!!Prep work»! crawling an automated way of browsing the web for pages»! archiving, analyzing, organizing, indexing!!retrieval»! query matching»! ranking search results»! displaying search results 4

Search Query Syntax!!The space between keywords is interpreted as AND for some search engines, but as OR for others.!!query: I'm interested in cancer in adults.»! Boolean logic: AND»! Search query: +cancer +adults!!query: I'm interested in radiation, but not nuclear.»! Boolean logic: NOT»! Search query: radiation -nuclear 5 Outline!!Generic Searching!!How Google Works!!Digital Preservation»! Internet Archive»! Memento»! Archive Facebook 6

Google!s Three Main Components!!Crawling!!Indexing!!Page Weighting http://ppcblog.com/how-google-works/ 7 Google!s Crawler, the Googlebot!!Consists of many computers requesting and fetching pages»! Googlebot can request thousands of different pages simultaneously!!when Googlebot fetches a page, it adds all of the links in the page to a queue for subsequent crawling (deep crawling)»! Googlebot can quickly build a list of links that can cover broad reaches of the web!!google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change (fresh crawls) 8

Submitting URLs to Googlebot!!http://www.google.com/addurl.html!!Rejects URLs that Google suspects are trying to deceive users»! including hidden text or links on a page»! stuffing a page with irrelevant words»! cloaking (aka bait and switch)»! using sneaky redirects»! creating doorways, domains, or sub-domains with substantially similar content»! sending automated queries to Google»! linking to bad neighbors. 9 Google's Indexer!! Parses webpages discovered by the Googlebot!! Converts them into a set of hits (word occurrences).»! For each document in which the word appears, indexer stores the position of the word, font size, capitalization!! Creates an inverted index that is indexed by word pointing to documents 10

Google's Page Weighter, PageRank!!Assigns each webpage a relevancy score!!a webpage's PageRank depends upon»! frequency and location of keywords within the page»! how long the webpage has existed»! number of other webpages that link to the page!!most important factor is number of pages that link to the page»! essentially, PageRank is a vote, by all the other pages on the Web, about how important a page is http://computer.howstuffworks.com/google1.htm 11 PageRank Example PageRank: Number and importance of links pointing to you! http://www.mattcutts.com/blog/seo-for-bloggers/! Time 6:20-8:30! 12

Manipulating PageRank / Webspam!!Google bomb»! is created if many sites link to the page using the same anchor text (increases the page rank)»! ex: In 1999, a search for more evil than Satan himself resulted in the Microsoft homepage!!spamdexing»! deliberately modifying HTML pages to increase the chance of their being placed close to the beginning of search engine results 13 Manipulating PageRank / Webspam!!link doping»! embedding a large number of gratuitous hyperlinks on a website, in exchange for reciprocal links!!google Jacking (page hijacking)»! creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. 14

How to Improve Your PageRank!!Search Engine Optimization (SEO)!!Straight from Google»! http://www.mattcutts.com/blog/seo-for-bloggers/»! keyword tool (time 17:04-18:18) "! http://adwords.google.com/select/keywordtoolexternal»! be relevant, but don t overdo it! (time 23:11-25:24) 15 Outline!!Generic Searching!!How Google Works!!Digital Preservation»! Internet Archive»! Memento»! Archive Facebook 16

Digital Preservation!!Much of the record of our lives is digital.»! How do we keep these records?»! How do we walk down memory lane?!!backup strategies!!recording and archiving the web!!accessing past versions of webpages Web Science and Digital Libraries Research Group at ODU (http://ws-dl.blogspot.com/) 17 Internet Archive!!Internet Archive»! non-profit that was founded to build an Internet library, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format»! http://www.archive.org!!internet Archive!s Wayback Machine»! allows you to browse through 85 billion web pages archived from 1996 to a few months ago»! http://www.archive.org/web/web.php 18

Memento!!Time Travel for the Web»! http://www.mementoweb.org/!!mementofox Firefox add-on»! https://addons.mozilla.org/en-us/firefox/addon/ 100298/ Demo!!Uses Internet Archive and others to present a webpage as it existed at a certain date/time Web Science and Digital Libraries Research Group at ODU (http://ws-dl.blogspot.com/) 19 Archive Facebook!!Grab a stand-alone archive of your Facebook account!!archive everything you want instead of what Facebook wants!!preserves FB look-and-feel!!http://bit.ly/archivefb http://www.slideshare.net/matkelly01/ndiippndsa-2011-archive-facebook! 20

Archive Facebook Content Dump versus WYSIWYG 21 Outline!!Generic Searching!!How Google Works!!Digital Preservation»! Internet Archive»! Memento»! Archive Facebook 22