Using the Penn State Search Engine

Similar documents
A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Digital Communication. Daniela Andreini

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Searching the Deep Web

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Searching the Deep Web

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

Search Engine Technology. Mansooreh Jalalyazdi

Table of contents. 1. Backlink Audit Summary...3. Marketer s Center. 2. Site Auditor Summary Social Audit Summary...9

WebSite Grade For : 97/100 (December 06, 2007)

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

СANONICAL URLS IN MAGENTO 1 & MAGENTO 2. Author: Alina Bragina 2018 Amasty Ltd.

DMI Exam PDDM Professional Diploma in Digital Marketing Version: 7.0 [ Total Questions: 199 ]

Learn SEO Copywriting

Objective Explain concepts used to create websites.

deseo: Combating Search-Result Poisoning Yu USF

SEARCH ENGINE OPTIMIZATION

Skill Area 209: Use Internet Technology. Software Application (SWA)

EBOOK. On-Site SEO Made MSPeasy Everything you need to know about Onsite SEO

Question No : 1 Web spiders carry out a key function within search. What is it? Choose one of the following:

Chapter 4 A Hypertext Markup Language Primer

Traffic Overdrive Send Your Web Stats Into Overdrive!

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

The Anatomy of a Large-Scale Hypertextual Web Search Engine


DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Provided by TryEngineering.org -

SEO. Definitions/Acronyms. Definitions/Acronyms

Jeffrey Friel. Enter MU_SR_GRADE_ROSTER_STATUS to search for the query

CrownPeak Playbook CrownPeak Search

Dahlia Web Designs LLC Dahlia Benaroya SEO Terms and Definitions that Affect Ranking

magento_1:xml-google-sitemap

XML. Jonathan Geisler. April 18, 2008

Search Engines. Charles Severance

: Semantic Web (2013 Fall)

Site Auditor Summary. Total Issues: 95 (Change: 87%) 7 Pages Crawled - June 18, Content Issues 2 0% 3 0%

ESRI stylesheet selects a subset of the entire body of the metadata and presents it as if it was in a tabbed dialog.

Information Retrieval

Creating a Classifier for a Focused Web Crawler

Searching the Web for Information

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

> Semantic Web Use Cases and Case Studies

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

2018 SEO CHECKLIST. Use this checklist to ensure that you are optimizing your website by following these best practices.

Exam IST 441 Spring 2014

Collage II Tips and Tricks

SEO According to Google

3 Media Web. Understanding SEO WHITEPAPER

What the is SEO? And how you can kick booty in the interwebs game

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Activity: Google. Activity #1: Playground. Search Engine Optimization Google Results Organic vs. Paid. SEO = Search Engine Optimization

Store Locator for Magento 2. User Guide

Learn How To Write Copy

Information Retrieval Spring Web retrieval

Full-Text Indexing For Heritrix

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

AIM. 10 September

The 6 Most Common Website SEO Mistakes

XML in Book Publishing

Almost 80 percent of new site visits begin at search engines. A couple of years back Nielsen published a list of popular search engines.

2.3 Algorithms Using Map-Reduce

Training program. An OnCrawl Rocket Program training backed by your data and SEO team-oriented.

Endless Monetization

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 14 Database Connectivity and Web Technologies

How to Promote Your Business Online... and turn website visitors into paying customers. Jayne Reddyhoff 9 th September 2008

Google Search Appliance

SEO Search Engine Optimisation An Intro to SEO. Prepared By: Paul Dobinson Tel

The Quickly Evolving World of Local SEO

White Paper DocCheck Search

Website review google.com

All Adobe Digital Design Vocabulary Absolute Div Tag Allows you to place any page element exactly where you want it Absolute Link Includes the

SEO basics. Chris

CS47300 Web Information Search and Management

DATA MINING - 1DL105, 1DL111

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Ranking in a Domain Specific Search Engine

The Geeks Guide To SEO

What s new in SharePoint Search 2010 for end users. IW109 Mirjam van Olst

A PRACTICE BUILDERS white paper. 8 Ways to Improve SEO Ranking of Your Healthcare Website

Digital Marketing for Small Businesses. Amandine - The Marketing Cookie

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Site Audit Boeing

A New Model of Search Engine based on Cloud Computing

Google Analytics: Part 3

Working with WebNode

A network is a group of two or more computers that are connected to share resources and information.

Chapter 2. Architecture of a Search Engine

Spend Less, Make More: 5 Ways to Boost Online Sales While Lowering Ad Spend

Digital Marketing. Introduction of Marketing. Introductions

Why it Really Matters to RESNET Members

Flynax SEO Guide Flynax

Search Engine Visibility Analysis

How to do an On-Page SEO Analysis Table of Contents

EVALUATING SEARCH EFFECTIVENESS OF SOME SELECTED SEARCH ENGINES

Using Deep Links for Growth. Phillip Nelson Director of Product, Quixey

Video Site Maps: Why Bother? Copyright 2008 Customer Paradigm, All Rights Reserved.

Transcription:

Using the Penn State Search Engine Jeffrey D Angelo and James Leous root@aset.psu.edu http://aset.its.psu.edu/ ITS Academic Services and Emerging Technologies ITS Training root@aset.psu.edu p.1

How Does Search Work? In general search engines deploy a piece of software called a crawler which begins searching at a given URL or URLs and follows the links contained in one page to others. There may be boundaries on this search ( only sites in this domain or only sites N levels deep in the current site). Crawlers collect these pages and pass all or parts of it to a catalog or index of the site(s). The index is passed to the actual search engine software where pages are ranked according to a numerical algorithm. Google s numerical algorithm is called PageRank. The usefulness of a crawler based search engine depends on all three of these parts. ITS Training root@aset.psu.edu p.2

By the Numbers From the last crawl (Wednesday evening, October 27, 2004): Time: took 7 hours 34 min 20 sec to query the Web servers, 2 hours 55 min 29 sec to build the index and another 50 min and 49 sec to replicate and test the index Start to finish: 11 hr 20 min 38 sec to crawl/index Penn State ITS Training root@aset.psu.edu p.3

By the Numbers (cont d) Amount: 838,958 total URLs crawled/indexed 466,077 URLs found that we excluded (non.psu.edu, containing?, etc) Queries: Total for September, 2004: 582,441 Average for September: 19,415/day 1 every 5 seconds about 1,200 in the time to give this seminar ITS Training root@aset.psu.edu p.4

Why Search? Navigational The immediate intent is to reach a particular site. ITS Training root@aset.psu.edu p.5

Why Search? Navigational The immediate intent is to reach a particular site. Informational The intent is to acquire some information assumed to be present on one or more pages. ITS Training root@aset.psu.edu p.5

Why Search? Navigational The immediate intent is to reach a particular site. Informational The intent is to acquire some information assumed to be present on one or more pages. Transactional The intent is to perform some Web mediated activity. ITS Training root@aset.psu.edu p.5

Does Google Engine Scale? Look at the numbers: www.google.com search.psu.edu 4.3 Billion pages 0.83 Million pages Look at the connections: www.google.com relies heavily on connections of important pages to other important pages search.psu.edu with the exception of some central pages, very few departmental pages actually link across academic or administrative boundaries. ITS Training root@aset.psu.edu p.6

What Really Matters? Although Google s PageRank algorithm is a secret, the following is known: Links from other pages are heavily weighted <TITLE> content is very important Keyword density within a page is important Keyword laden links vs. click here Meta keyword tags are not that important ITS Training root@aset.psu.edu p.7

Are Tags Irrelevant? Under our current system: Meta keywords are not that heavily weighted Penn State sample differs from Google sample with regard to interconnections Use keywords in <TITLE> tags of page as well as keyword laden links to other pages within and outside of your site Remember Google is current state of the art; the future is probably a more federated search mechanism ITS Training root@aset.psu.edu p.8

Search Integration Integrating the Penn State Search Engine Into Your Site Invoking ITS Training root@aset.psu.edu p.9

Search Integration Integrating the Penn State Search Engine Into Your Site Invoking Restricting Search Results ITS Training root@aset.psu.edu p.9

Search Integration Integrating the Penn State Search Engine Into Your Site Invoking Restricting Search Results Customizing Search Results Style * New Feature! ITS Training root@aset.psu.edu p.9

Invoking Search Engine Invoking the Search Engine from Your Site Easy and convenient way for visitors to search from your site Directions: <http://aset.its.psu.edu/googledocs/instructions.html#invoke> ITS Training root@aset.psu.edu p.10

Restricting Search Results as_sitesearch Directions: <http://aset.its.psu.edu/googledocs/instructions.html#restrict> ITS Training root@aset.psu.edu p.11

Restricting Search Results as_sitesearch Restrict results to only URLs that begin with the value you set for this parameter. Adds site: your url to the search query. sitesearch Directions: <http://aset.its.psu.edu/googledocs/instructions.html#restrict> ITS Training root@aset.psu.edu p.11

Restricting Search Results as_sitesearch Restrict results to only URLs that begin with the value you set for this parameter. Adds site: your url to the search query. sitesearch Restrict results to only URLs that begin with the value you set for this parameter. Does not add site: your url restrict Directions: <http://aset.its.psu.edu/googledocs/instructions.html#restrict> ITS Training root@aset.psu.edu p.11

Restricting Search Results as_sitesearch Restrict results to only URLs that begin with the value you set for this parameter. Adds site: your url to the search query. sitesearch Restrict results to only URLs that begin with the value you set for this parameter. Does not add site: your url restrict Uses a subcollection which administrators must set up for you before a crawl. Allows multiple sites to be grouped together; takes a few days to take effect. Directions: <http://aset.its.psu.edu/googledocs/instructions.html#restrict> ITS Training root@aset.psu.edu p.11

Customize Format/Style Customize Search Results Format/Style Two choices for alternate format: XML and XSLT. XML formatted results good for dynamic applications XSLT translate HTML into your desired format good for Web visitors ITS Training root@aset.psu.edu p.12

Using XML Results XML is the extensible Markup Language. It is a format containing data that can be easily passed between programs and systems. You can have a dynamic application query the search engine directly. This application can do offline processing. This application can serve as an intermediary between the Web visitor and the search engine. This allows you to provide your own formatting controls. Downside: it requires some programming work. ITS Training root@aset.psu.edu p.13

Using XML Results (cont d) Directions on receiving XML: <http://aset.its.psu.edu/googledocs/instructions.html#xml> ITS Training root@aset.psu.edu p.14

Customizing HTML results All search results begin in XML format. The Google Search Appliance translates them into another format, such as HTML. It uses Extensible Stylesheet Language Transformation (XSLT) documents to translate into a particular format or style. Use the proxystylesheet parameter to specify what XSLT you wish to use. proxystylesheet=pennstate means, use the XSLT for the collection called PennState which is the master index for our appliance. proxystylesheet=<a URL> means, attempt to read an XSLT file at the address provided. ITS Training root@aset.psu.edu p.15

Customizing HTML results Directions: <http://aset.its.psu.edu/googledocs/instructions.html#format> ITS Training root@aset.psu.edu p.16

Self serve XSLT [ NEW! ] The NEW Self serve XSLT file generator form. Basic knowledge of HTML required; knowledge of XSLT not required. You may design your site style in any program you wish (Dreamweaver, etc), and simply copy/paste the HTML into the Web form. Generator URL with directions: <http://aset.its.psu.edu/cgi-bin/googledocs_custom_xslt.cgi> ITS Training root@aset.psu.edu p.17

Final Notes Google Search Appliance Backup Strategy Proposal: <http://www.personal.psu.edu/staff/j/c/jcd/useless/ google_backup_strategy/> Questions? ITS Training root@aset.psu.edu p.18