Search Engine Technology. Mansooreh Jalalyazdi

Similar documents
Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

3 Media Web. Understanding SEO WHITEPAPER

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

Search Engines. Charles Severance

Site Auditor Summary. Total Issues: 95 (Change: 87%) 7 Pages Crawled - June 18, Content Issues 2 0% 3 0%

You got a website. Now what?

Table of contents. 1. Backlink Audit Summary...3. Marketer s Center. 2. Site Auditor Summary Social Audit Summary...9

Almost 80 percent of new site visits begin at search engines. A couple of years back Nielsen published a list of popular search engines.

SEO Search Engine Optimizing. Techniques to improve your rankings with the search engines...

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

Site Audit SpaceX

power up your business SEO (SEARCH ENGINE OPTIMISATION)

FAQ: Crawling, indexing & ranking(google Webmaster Help)

SEO Technical & On-Page Audit

Information Retrieval Spring Web retrieval

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Site Audit Virgin Galactic

SEO. Definitions/Acronyms. Definitions/Acronyms

SEO According to Google

Site Audit Boeing

XML Sitemap Splitter for Magento 2. User Guide

Website review excitesubmit.com

SEO Dubai. SEO Dubai is currently the top ranking SEO agency in Dubai, UAE. First lets get to know what is SEO?

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Provided by TryEngineering.org -

Information Retrieval May 15. Web retrieval

What Is Voice SEO and Why Should My Site Be Optimized For Voice Search?

How to Drive More Traffic to Your Website in By: Greg Kristan

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Here's how we are going to Supercharge WordPress.

SEO 1 8 O C T O B E R 1 7

SilverStripe - Website content editors.

CS47300: Web Information Search and Management

Information Retrieval and Web Search

Digital Marketing. Introduction of Marketing. Introductions

Objective Explain concepts used to create websites.

extreme searching: how to avoid extreme frustration and bird walks presented by Kathy Schrock Overview The Problems

THE HISTORY & EVOLUTION OF SEARCH

Search Engine Optimization and Placement:

AN SEO GUIDE FOR SALONS

For Starters Web 4.0. Entrée Thrive Online. Dessert Listen and Evolve. Search Marketing for Today s Lunch Menu

Promoting Website CS 4640 Programming Languages for Web Applications

CS47300 Web Information Search and Management

Traffic Overdrive Send Your Web Stats Into Overdrive!

Search Engine Optimization (SEO) using HTML Meta-Tags

Glossary of on line marketing terms

Search Engine Optimisation Basics for Government Agencies

Dahlia Web Designs LLC Dahlia Benaroya SEO Terms and Definitions that Affect Ranking

Webinar Series. Sign up at February 15 th. Website Optimization - What Does Google Think of Your Website?

Why it Really Matters to RESNET Members

Glossary of Tech Terms Accelerated Mobile Pages

A potential consumer in the sales funnels who has communicated with a business with the intent to purchase by a call, , or online form fill.

by Jimmy's Value World Ashish H Thakkar

URLs excluded by REP may still appear in a search engine index.

Intro. Notes from "Get to the Top on Google" by David Viney, available from Amazon.co.uk. Produced by WebDesignerHarrogate.co.uk for BeckwithIT.

How to do an On-Page SEO Analysis Table of Contents

How to Get Your Website Listed on Major Search Engines

Using the Penn State Search Engine

Searching the Web for Information

Activity: Google. Activity #1: Playground. Search Engine Optimization Google Results Organic vs. Paid. SEO = Search Engine Optimization

Learn How To Write Copy

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Global Search Engine Optimization (SEO) Services.

Search Engine Visibility Analysis

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

The Insanely Powerful 2018 SEO Checklist

Endless Monetization

SmartAnalytics. Manual

Effective On-Page Optimization for Better Ranking

2018 SEO CHECKLIST. Use this checklist to ensure that you are optimizing your website by following these best practices.

Dreamweaver Handout. University of Connecticut Prof. Kent Golden

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

SEO Toolkit Magento Extension User Guide Official extension page: SEO Toolkit

Why is Search Engine Optimisation (SEO) important?

Europcar International Franchisee Websites Search Engine Optimisation

CS/INFO 1305 Summer 2009

Search Like a Pro. How Search Engines Work. Comparison Search Engine. Comparison Search Engine. How Search Engines Work

Content Discovery of Invisible Web

ADD URL SEARCH ENGINE SUBMISSION YAHOO

All-In-One-Designer SEO Handbook

I m a new dad 30/06/10

Today we show how a search engine works

8 Building Traffic, Making Money, and Measuring Success

How To Construct A Keyword Strategy?

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

Technical SEO SEARCH ENGINE OPTIMIZATION

1. Conduct an extensive Keyword Research

SEO. A Lecture by Usman Akram for CIIT Lahore Students

Corso di Biblioteche Digitali

Marketing & Back Office Management

The 6 Most Common Website SEO Mistakes

Corso di Biblioteche Digitali

EBOOK. On-Site SEO Made MSPeasy Everything you need to know about Onsite SEO

What is SEO? { Search Engine Optimization }

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Building Your Blog Audience. Elise Bauer & Vanessa Fox BlogHer Conference Chicago July 27, 2007

SEARCH ENGINE INSIDE OUT

Transcription:

Search Engine Technology Mansooreh Jalalyazdi 1

2

Search Engines. Search engines are programs viewers use to find information they seek by typing in keywords. A list is provided by the Search engine or sites they think are relevant based on the special way they index the millions of sites 3

Some Search Engines http://www.google.com http://www.altavista.com http://www.askjeeves.com http://www.alltheweb.com http://www.inktomi.com http://www.lycos.com http://www.yahoo.com http://www.msn.com http://www.aol.com http://www.netscape.com http://www.excite.com http://www.infoseek.com http://www.looksmart.com http://www.firstgov.com http://www.wisenut.com http://www.hotbot.com http://www.go.com http://www.mamma..com http://www.northernlight.com http://www.dmoz.org http://www.snap.com http://www.overture.com http://www.webcrawler.com http://www.metacrawler.com http://www.metaseek.com http://www.dogpile.com http://www.ixquick.com http://www.vivisimo.com http://www.sherlockhound.com http://www.inquirus.com 4

Mooter SE 5

6

7

8

Ranking Importance 9

Search Engine Types Crawler-based (Active search engine) Google, AlltheWeb Human-Powered Directories (Passive search engine) Yahoo, LookSmart Meta-crawlers (Meta-search engine) AskJeeves Most search engines pull results from some or all of these sources, but one type normally dominates. 10

Some Other Search Engine Types News Search Engines Multimedia Search Engines Metacrawlers Kids Search Engines Regional Search Engines 11

Parts of a Crawler-base search engine Spider (Crawler) Index (Catalog) Search engine software Sift through index Rank them 12

Web spiders are simplistic They start with a given URL and grab that page. They find all the links in the page. They follow each of those links, retrieving a document. They repeat this process until something tells them to stop. 13

What Spiders (Robots) Do 14

Architecture of a Meta-search Feedback Query Dispatcher Knowledge Personalize User User Interface S E 1 S E 2 S E 3 Display Web 15

Example of Indexes http://example.com/herman Call me Ishmael. Doc # 0 Word # 0 1 2 Document table: <docs> <doc id="0" href="http://example.com/herman" /> </docs> Posting list: <postings> <posting doc="0" word="call" /> <posting doc="0" word="me" /> <posting doc="0" word="ishmael" /> </postings> 16

Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="me"> <posting doc="0"/><posting doc="1"/> </word> </index> Search Phrases: <word w= Ismael"> <posting doc= 0" wnum= 2" /> </word> 17

Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="me"> <posting doc="0"/><posting doc="1"/> </word> </index> Search Phrases: <word w= Ismael"> <posting doc= 0" wnum= 2" /> </word> 18

Example of Indexes Longtemps, je me suis couché de bonne heure. Doc # 1 Word # 0 1 2 3 4 5 6 7 Document table: <docs> <doc id="0" href="http://example.com/herman" /> <doc id="1" href="http://example.com/marcel" /> </docs> <postings> <posting doc="1" w="longtemps" /> <posting doc="1" w="je" /> <posting doc="1" w="me" /> <posting doc="1" w="suis" /> <posting doc="1" w="couché" /> <posting doc="1" w="de" /> <posting doc="1" w="bonne" /> <posting doc="1" w="heure" /> </postings> 19

Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="longtemps"> <posting doc="1"/> </word> <word w="bonne"> <posting doc="1"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="couché"> <posting doc="1"/> </word> <word w="de"> <posting doc="1"/> </word> <word w="heure"> <posting doc="1"/> </word> <word w="je"> <posting doc="1"/> </word> <word w="me"> <posting doc="0"/> <posting doc="1"/> </word> <word w="suis"> <posting doc="1"/> </word> </index>

Example of what Indexes <index>... <word w="bonne">... <word w="heure">... </index> <posting doc="1" wnum="6" /> </word> <posting doc="1" wnum="7"/> </word>

Better Rank Page factor Location (Title, headline, near the top of web page) Frequency( How often key words appeare) Off the page factor Link analysis (How pages link to each other) 22

Better Rank Emphasize specific, well-known phrases that relate to or describe your work. Always include an alt attribute with every image. Always include a title in EVERY document. Include your best, most descriptive content at the top of your web page. Use your ranked terms often, but in a natural way Splash pages using technologies such as Flash are empty as far as search engines are concerned. If you have to redirect, create a web page that redirects users to the new pages. Use both singular and plural forms of words. Use synonyms. 23

Better Rank Good title, good meta tags, good text Keeping all pages within a small number of clicks from your top page (2 or 3) Expect search engines to max out at around 500 pages from any particular site, subdivide large sites logically into subdomains Instead of gold.ac.uk/science/ science.gold.ac.uk Dynamic delivery systems that use? symbols in the URL string prevent search engines from getting to your pages 24

Image Search Title of the page Image file name, path, host Image Alt attribute Position in the document 25

Artificial links designed to boost ranking (Spam) Spamming refers to any kind of dishonest or misleading techniques used to get better positioning in search engines Hidden text Excessive repetition of keywords All search engines have different rules on spamming, and may exclude sites that do it. 26

HTML Tags <TITLE>: Caption in browser, Page name in favorites, Showed in Search results in some SEs. <META..> name= description name= keywords name= robots <META name= keywords content= some keywords > <META name= description content= a description of the page s contents > 27

Securing Your Web Site from SEs Robot Exclusion robots.txt File Meta tags Comment tags: <!--stopindex--> and <!--startindex--> 28

Robot Exclusion: Meta Tags <META NAME= ROBOTS CONTENT= INDEX, NOFOLLOW > Index the document, but don t follow links <!--stopindex--> This page last updated on 6/10/03. <!--startindex--> 29

Robot Exclusion: Robots.txt Allow TAMU, disallow others User-agent: TAMU-Ultraseek Disallow: User-agent: * Disallow: / Exclude directories User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /private/my.html 30

How to Spot Spiders Hits from robots or spiders show up in web logs. Look in system logs that accessed robots.txt Look for known spider IP addresses 31

Understand Limitations of Search Egines Spiders stumble on complex content: Frames often confuse search engine spiders. Spiders tend to ignore Javascript-based navigation. Spiders can get stuck inside dynamic (e.g. database-driven) websites. Search spiders or crawlers do *not* crawl in real time Lag times getting info to the index vary by search engine If a website is not submitted to the search engine it won t be crawled Not every page from a website is crawled A webmaster can choose to not have a page crawled Formats like PDF, Flash, Zip files, executable programs, and others cannot be searched The Invisible Web 32

Learning Search Technology This learning search technology logs every search and click on search results that visitors to your site make. This data is processed daily to improve the search results. This means we are learning from your visitors every day. Use the intelligence of your visitors to improve your search results. Sometimes user will often return to the search page and click on a different result. During our processing we detect this type of behavior and reduce the importance of that first, unfruitful click. 33

Learning Search 34

Learning Search Works 35

Comprehensive Reporting Tools category activity chart 36

Google at a glance Google currently ranks and searches three billion Web pages To speed up PageRank, Stanford researchers have developed a trio of techniques based on a branch of mathematics called numerical linear algebra. The researchers make use of their discovery that on most sites, up to 80 percent of links point to other pages on the same site-- each site looks like a thick block of links The third method, called Adaptive PageRank, relies on the fact that lower-ranking sites tend to be computed faster than higherranking ones. 37

Searching is the first thing people use on the web now, and there are fewer and fewer alternatives. Thank you! 38