Search Engine Technology Mansooreh Jalalyazdi 1
2
Search Engines. Search engines are programs viewers use to find information they seek by typing in keywords. A list is provided by the Search engine or sites they think are relevant based on the special way they index the millions of sites 3
Some Search Engines http://www.google.com http://www.altavista.com http://www.askjeeves.com http://www.alltheweb.com http://www.inktomi.com http://www.lycos.com http://www.yahoo.com http://www.msn.com http://www.aol.com http://www.netscape.com http://www.excite.com http://www.infoseek.com http://www.looksmart.com http://www.firstgov.com http://www.wisenut.com http://www.hotbot.com http://www.go.com http://www.mamma..com http://www.northernlight.com http://www.dmoz.org http://www.snap.com http://www.overture.com http://www.webcrawler.com http://www.metacrawler.com http://www.metaseek.com http://www.dogpile.com http://www.ixquick.com http://www.vivisimo.com http://www.sherlockhound.com http://www.inquirus.com 4
Mooter SE 5
6
7
8
Ranking Importance 9
Search Engine Types Crawler-based (Active search engine) Google, AlltheWeb Human-Powered Directories (Passive search engine) Yahoo, LookSmart Meta-crawlers (Meta-search engine) AskJeeves Most search engines pull results from some or all of these sources, but one type normally dominates. 10
Some Other Search Engine Types News Search Engines Multimedia Search Engines Metacrawlers Kids Search Engines Regional Search Engines 11
Parts of a Crawler-base search engine Spider (Crawler) Index (Catalog) Search engine software Sift through index Rank them 12
Web spiders are simplistic They start with a given URL and grab that page. They find all the links in the page. They follow each of those links, retrieving a document. They repeat this process until something tells them to stop. 13
What Spiders (Robots) Do 14
Architecture of a Meta-search Feedback Query Dispatcher Knowledge Personalize User User Interface S E 1 S E 2 S E 3 Display Web 15
Example of Indexes http://example.com/herman Call me Ishmael. Doc # 0 Word # 0 1 2 Document table: <docs> <doc id="0" href="http://example.com/herman" /> </docs> Posting list: <postings> <posting doc="0" word="call" /> <posting doc="0" word="me" /> <posting doc="0" word="ishmael" /> </postings> 16
Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="me"> <posting doc="0"/><posting doc="1"/> </word> </index> Search Phrases: <word w= Ismael"> <posting doc= 0" wnum= 2" /> </word> 17
Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="me"> <posting doc="0"/><posting doc="1"/> </word> </index> Search Phrases: <word w= Ismael"> <posting doc= 0" wnum= 2" /> </word> 18
Example of Indexes Longtemps, je me suis couché de bonne heure. Doc # 1 Word # 0 1 2 3 4 5 6 7 Document table: <docs> <doc id="0" href="http://example.com/herman" /> <doc id="1" href="http://example.com/marcel" /> </docs> <postings> <posting doc="1" w="longtemps" /> <posting doc="1" w="je" /> <posting doc="1" w="me" /> <posting doc="1" w="suis" /> <posting doc="1" w="couché" /> <posting doc="1" w="de" /> <posting doc="1" w="bonne" /> <posting doc="1" w="heure" /> </postings> 19
Example of what Indexes Index: <index> <word w="ishmael"> <posting doc="0"/> </word> <word w="longtemps"> <posting doc="1"/> </word> <word w="bonne"> <posting doc="1"/> </word> <word w="call"> <posting doc="0"/> </word> <word w="couché"> <posting doc="1"/> </word> <word w="de"> <posting doc="1"/> </word> <word w="heure"> <posting doc="1"/> </word> <word w="je"> <posting doc="1"/> </word> <word w="me"> <posting doc="0"/> <posting doc="1"/> </word> <word w="suis"> <posting doc="1"/> </word> </index>
Example of what Indexes <index>... <word w="bonne">... <word w="heure">... </index> <posting doc="1" wnum="6" /> </word> <posting doc="1" wnum="7"/> </word>
Better Rank Page factor Location (Title, headline, near the top of web page) Frequency( How often key words appeare) Off the page factor Link analysis (How pages link to each other) 22
Better Rank Emphasize specific, well-known phrases that relate to or describe your work. Always include an alt attribute with every image. Always include a title in EVERY document. Include your best, most descriptive content at the top of your web page. Use your ranked terms often, but in a natural way Splash pages using technologies such as Flash are empty as far as search engines are concerned. If you have to redirect, create a web page that redirects users to the new pages. Use both singular and plural forms of words. Use synonyms. 23
Better Rank Good title, good meta tags, good text Keeping all pages within a small number of clicks from your top page (2 or 3) Expect search engines to max out at around 500 pages from any particular site, subdivide large sites logically into subdomains Instead of gold.ac.uk/science/ science.gold.ac.uk Dynamic delivery systems that use? symbols in the URL string prevent search engines from getting to your pages 24
Image Search Title of the page Image file name, path, host Image Alt attribute Position in the document 25
Artificial links designed to boost ranking (Spam) Spamming refers to any kind of dishonest or misleading techniques used to get better positioning in search engines Hidden text Excessive repetition of keywords All search engines have different rules on spamming, and may exclude sites that do it. 26
HTML Tags <TITLE>: Caption in browser, Page name in favorites, Showed in Search results in some SEs. <META..> name= description name= keywords name= robots <META name= keywords content= some keywords > <META name= description content= a description of the page s contents > 27
Securing Your Web Site from SEs Robot Exclusion robots.txt File Meta tags Comment tags: <!--stopindex--> and <!--startindex--> 28
Robot Exclusion: Meta Tags <META NAME= ROBOTS CONTENT= INDEX, NOFOLLOW > Index the document, but don t follow links <!--stopindex--> This page last updated on 6/10/03. <!--startindex--> 29
Robot Exclusion: Robots.txt Allow TAMU, disallow others User-agent: TAMU-Ultraseek Disallow: User-agent: * Disallow: / Exclude directories User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /private/my.html 30
How to Spot Spiders Hits from robots or spiders show up in web logs. Look in system logs that accessed robots.txt Look for known spider IP addresses 31
Understand Limitations of Search Egines Spiders stumble on complex content: Frames often confuse search engine spiders. Spiders tend to ignore Javascript-based navigation. Spiders can get stuck inside dynamic (e.g. database-driven) websites. Search spiders or crawlers do *not* crawl in real time Lag times getting info to the index vary by search engine If a website is not submitted to the search engine it won t be crawled Not every page from a website is crawled A webmaster can choose to not have a page crawled Formats like PDF, Flash, Zip files, executable programs, and others cannot be searched The Invisible Web 32
Learning Search Technology This learning search technology logs every search and click on search results that visitors to your site make. This data is processed daily to improve the search results. This means we are learning from your visitors every day. Use the intelligence of your visitors to improve your search results. Sometimes user will often return to the search page and click on a different result. During our processing we detect this type of behavior and reduce the importance of that first, unfruitful click. 33
Learning Search 34
Learning Search Works 35
Comprehensive Reporting Tools category activity chart 36
Google at a glance Google currently ranks and searches three billion Web pages To speed up PageRank, Stanford researchers have developed a trio of techniques based on a branch of mathematics called numerical linear algebra. The researchers make use of their discovery that on most sites, up to 80 percent of links point to other pages on the same site-- each site looks like a thick block of links The third method, called Adaptive PageRank, relies on the fact that lower-ranking sites tend to be computed faster than higherranking ones. 37
Searching is the first thing people use on the web now, and there are fewer and fewer alternatives. Thank you! 38