Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline

Size: px
Start display at page:

Download "Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline"

Transcription

1 Internet Search (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 2 1

2 The Web Document collections are scattered across many geographical areas. Constraints prohibiting the centralization of data include: Data security Volume Rate of change Political and legal constraints Other proprietary motivations 3 Web Search Parallel and distributed processing Web search tools access data distributed on servers worldwide but indexed centrally. Most of these systems have a partitioned index on large clusters of servers with a centralized control. They store pointers in the form of hypertext links to various Web servers. 4 2

3 Partitioned Indexing Partitioning of index across multiple machines, based on either: Terms (Global index organization) Each node holds posting list for some terms Using content-index, query terms sent to nodes having the terms Higher concurrency level, but larger postings lists Documents (Local index organization) more common Each node holds a complete index (shorter PLs) Query terms sent to all nodes Top k results from each node merged Global statistics (e.g.. idf) must be calculated A Hybrid approach in Tiered Indexing may be used 5 Index Tiering A popular early termination technique to improve the efficiency of query processing Dividing nodes into two tiers to allocate the index of most popular documents on tier 1 and the rest on tier 2. Search tier 1 first, if not enough results then search tier 2. Note: other popular early termination techniques (top-doc and query pruning) were discussed earlier in the semester! 6 3

4 Distributed Index Construction Not possible on a single machine Various architecture for distributed indexing MapReduce architecture (a term-partitioned index) Master node assigns tasks to worker nodes (map workers & reduce workers) to split up the computing jobs: Map Phase: Parsing & building localized <term, doc> pairs Reduce Phase: Combining/merging posting pairs for each term 7 MapReduce (Cont d) Map & reduce phases can be done in parallel on many machines A map machine can be a reducer machine in the process Data broken into pieces (shards) generally 16M-64 M [128M] and send to map workers as they finish their job Map workers work on one shard at a time (generally), unless having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf> Sort based on term, and then secondary key (doc_id) The same keys (terms) are assigned to the same reduce worker Load should be balanced on the reducers 8 4

5 MapReduce (Cont d) Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, Query Servers Each server has its own disk holding a portion of index Queries are distributed, via a centralized control, to servers that contain the related posting lists Common terms may map to many servers No single point of resource contention (efficient) If a server crashes, that portion of index is not available 10 5

6 Index in Dynamic Environment Data collection is not static Reconstruct the index periodically from scratch (many search engines use this) Maintain an auxiliary index to store new document & remerge with existing index Maintain multiple indexes - complicated in maintaining collection statistics 11 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social seacrh Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 12 6

7 Definitions. Web graph: each page is a node and links are directed edges from one node to other node Out-links (out-degree) A: links from page A to B In-links (in-degree) A: links from other pages to A Sink: if out-links = 0 Source: if in-links=0 Static page: pages that are generated prior to any request Dynamic page: pages that generated as the result of a request Hidden/deep web: pages with no links/password protected/via a Form, Indexable Web: union of pages indexed by major search engines 13 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search engines are evaluated based on top N documents. Recall estimation is very difficult Precision is of limited concern, as many users do not look beyond 1 st screen. => How fast and accurate the first results screen is generated? 14 7

8 Web Page Ranking Considering both query dependant and query independent scores (captured during indexing), a global score is generated for each page: Query dependant score Similarity measures such as Cosine, BM25, proximity, Query independent score Link analysis (anchor text, popularity metrics such as: authorities and hub, page rank, ) Sponsored search Localized search Query log analysis etc. 15 Query Log Analysis Using user query patterns on certain days and time of day, week, month, and year, many optimizations are possible: Pre-cache likely Web pages in anticipation of user queries to reduce page access delays; increasing system throughput (efficiency optimization) Possible to adjust relevance ranking to tune for certain user queries (accuracy optimization) 16 8

9 Anchor Text Short, 2-3 terms, describe the linked/destination page. May/may not be a different point of view than the author s. Anchor text of links to a doc d i included in index for d i Extended anchor text (text surrounding anchor text) may also be used Generally weighted based on frequency (notion of idf) Spamming problem 17 Page Rank A scoring mechanism in Web search (trade marked by Google and patented by Stanford) Generally calculated at the time of crawling Using incoming and outgoing links as an indicator of popularity, adjusts Web page score Popular page is defined as a page that - Many Web pages link to it (inlinks) - Important (popular) pages link to it May be affected by link spam 18 9

10 Page Rank PageRank ( A) = (1 d ) + d N PageRank ( D ) D... D n C ( D 1 i ) i C(D i ) : number of links out from page D i d : damping factor (from 0-1; commonly 0.85) N: total number of pages An Iterative Algorithm: Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,. 19 Authorities and Hub Various algorithms based on assigning each retrieved web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999) Authority page: an authoritative source on a given topic Hub page: page listing pointers to authority pages on a topic Authority score: summation of scores of all the hubs pointing to that authority page Hub score: summation of scores of all authority pages the hub is pointing to 20 10

11 Computing Authority and Hub Scores Retrieve all pages containing the query term t. This is called root set. (~200 pgs) Create a set including union of root set pages, pages that point to root set pages, and pages that root set pages point to. This is called base set. Using the base set to compute the hub and authority scores. An iterative algorithm: Initialize hubs and authorities a score of 1 Update s(h) and s(a) 21 Sponsored Search Search system vendors sell advertisers keywords so that whenever such words are issued in a query, the advertiser s desired homepage link is returned. Sponsored search results are biased towards advertisers with higher bids, click frequency of Ads, Significant revenue is generated to search engine vendors via such search approach (ex.: per click (50 sents to 15 dollars) 22 11

12 Sponsored Search Search engines maintain an advertisement database (Description of advertisement, link to that page, bids, popularity, ) Searching the advertisement database for a match to: query terms keywords extracted from retrieved result page (pseudo-relevance feedback, page features, ) Ranking advertisements based on bids (on keywords) and advertisement popularity (using clickthrough data logs) 23 Localized Search Using geographic information to modify the ranking of results (in addition to SC scores, link based scores, ). Geographic information maybe derived from: Location of device sending the query Context of query restaurant near Al Capone s home s town restaurant Near White Sox stadium Geographic location in the query Chicago restaurants Geographic location in a document metadata 24 12

13 Result Snippets Providing users a short summary (snippet) of page (title, url, link to cached page, snippet). Static snippets Query independent Created at indexing time and cached Containing title, n number of sentences/words, (NLP can be used) Dynamic snippets Query dependent Created at the time of results scoring Windows of the document - also called KWIC (keyword in context) 25 Result Snippets Index maintains sentence level information Snippet sentences can be picked: Based on query term(s): heading Location in document (n th sentence) Closeness of query terms in sentence Ratio of query terms in sentence Unique query terms in sentence From page metadata 26 13

14 Result Snippets An effective snippet should:(clarke et al 2007 s clickthrough analysis) have all the query terms (unless already included in title) Use the page metadata, if needed Display URL and mark the query terms Provide meaningful snippets vs. only some keywords 27 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 28 14

15 Social Search Social search introduces new aspects to search engines Village paradigm (Collaborative) [Horowitz & Kamvar, WWW 10] Crowd/ Social network /friends vs. Corpus-based Routing questions to potential answerers Community of users, sharing goal or interest, participate in search and interact with each other online YouTube, Twitter, Flickr, Facebook, Myspace, LinkedIn, forums, blogs, online games, From Wikipedia: Social search or a social search engine is a type of web search that takes into account the Social Graph of the person initiating the search query. When applied to web search this Social-Graph approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the documents 29 Real-Time Search Traditional search indexes the crawled pages Real-time search results of search engines such as Google, Bing, Yahoo come from variety of real-time search services such as twitter, flicker, your-tube, etc. Receive data directly from various social media and blogs (subscribed to social networking sites) A filtering engine identifies spams Measuring relevance -- The ranking is based on: Time, relevance to query, number of followers of authors, reputation of a link defined by the frequency of forwarding (re-tweets), First real time search: Summize in 2007 with real time trend analysis later on merged with twitter 2008) 30 15

16 Social Search Documents or websites are deemed relevant if searcher s social network were also interested in it. Nature of queries Many cases opinionated, subjective Query length (Many cases longer queries than Web s) Index Storing user s behavior ( responsiveness, answer quality, expertise) Mapping users to topics 31 Social Search Social Search Ranking based on combination of: Query-dependent (prob. of a good answer to query q by user u) Similarity of results to query (various ranking: cosine, bm25, proxomity, ) Relatedness of query/results to user Query-independent How many users bookmarked x Social Trust Similarity of asker to answerer -- user profiles similarity, users connectedness 32 16

17 Social Search Mapping users to topics. An example: [Horowitz & Kamvar, WWW 10] User specifies interest /expertise in topics Friends of users indicate the expertise of user u in topics Automatically identified topics from User s existing online profiles User s homepages, blogs User s status messages (Twitter, Facebook, IM, ) 33 Social Search Measuring connectedness using cosine similarity over various features, such as: [Horowitz & Kamvar, WWW 10] Social connection (common friends and affiliations) Demographic similarity Profile similarity (e.g., common favorite movies) Vocabulary match (e.g., IM shortcuts) Chattiness match (frequency of follow-up messages) Verbosity match (the average length of messages) Politeness match (e.g., use of Thanks! ) Speed match (responsiveness to other users) 34 17

18 Social Search Sample approach: [Karweg, et.al, CIKM 11]: Social Relevance Score (SRS) ranks the result elements of a query according to their social relevance for the user. It is calculated based on 2 factors: Engagement Intensity: how intense the users interacted with the result Engagement: Interaction in terms of recommendation, rating, status messages Intensity: effort of textual feedback vs. rating score /thumps up Trust Score : level of trust to those who recommend a link Assigned by users & refined by social network analysis using page-rank on social graph SRS(i): social rank score of document/page i X: a user in social network interacted/recommended page i SRS ( i) = t ( x). e ( i) x E i s x 35 Social Search -- Trust Trust has been discussed for years in sociology and social psychology [Marsh, Ph.D. dissertation,1994] formalized trust as a computational concept (agents that keep history of behaviors) Trust in peer-to-peer, EigenTrust [Kamvar et al. 2004] (corrupt vs. valid files) Various efforts in formalization of trust in recommender systems and social network [Swearingen and Sinha,2001], [Ziegler and Golbeck [2006]. The more similar two people were, the greater the trust between them [Ziegler and Golbeck [2006]. Trust in a person is a commitment to an action based on a belief that the future actions of that person will lead to a good outcome. Example: Alice trusts Bob regarding if she chooses to read a message (commits to an action) that Bob sends her (based on her belief that Bob will not waste her time 36 18

19 Tagging Social media sites allow users to tag the data User tags act as manual indexing of data in addition to automatic indexing User tags serve as folksonomy Tags are used to organize and search data Challenges with the tagged data: Vocabulary mismatch Noisy or Spam tags Missing tags 37 Searching Tagged Data: Vocabulary Mismatch problem Tag keywords describe textual or non-textual data and are used to search for items Tags are very sparse (only few keywords) Boolean (conjunctive, disjunctive) search can lead to high precision/low recall or high recall/low precision To reduce the vocabulary mismatch perform stemming, or pseudo-relevance feedback 38 19

20 Searching Tagged Data: Noisy and Spam Tags Spam/misspelled/non-relevant tags mislead search Some incentive must be provided to users to report spam tags, and to enter good quality tags. Log and statistical information may help to identify spam tags 39 Searching Tagged Data: Missing Tags Automatically generate tags for items with missing tags, using: Term weight of textual representation of item Classification of item to a label (i.e.. Tag) 40 20

21 Tag Clouds The most popular tags are represented to users to provide a more wide view of collection Tag cloud displays the tags as a weighted list The font size is proportional to the weight Thanks to: tagcloud generator & F. Silvestri, CNR, Italy, S. Orlando, U. of Venice, Italy 41 Recommender Systems 21

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson The Anatomy of a Large-Scale Social Search Engine by Horowitz, Kamvar WWW2010 Web IR Input is a query of keywords

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Functionality, Challenges and Architecture of Social Networks

Functionality, Challenges and Architecture of Social Networks Functionality, Challenges and Architecture of Social Networks INF 5370 Outline Social Network Services Functionality Business Model Current Architecture and Scalability Challenges Conclusion 1 Social Network

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Social Search Networks of People and Search Engines. CS6200 Information Retrieval Social Search Networks of People and Search Engines CS6200 Information Retrieval Social Search Social search Communities of users actively participating in the search process Goes beyond classical search

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Search Engine Architecture II

Search Engine Architecture II Search Engine Architecture II Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Digital Marketing Proposal

Digital Marketing Proposal Digital Marketing Proposal ---------------------------------------------------------------------------------------------------------------------------------------------- 1 P a g e We at Tronic Solutions

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis

More information

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Social Networks 2015 Lecture 10: The structure of the web and link analysis 04198250 Social Networks 2015 Lecture 10: The structure of the web and link analysis The structure of the web Information networks Nodes: pieces of information Links: different relations between information

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Web Search Algorithms - 1 -

Web Search Algorithms - 1 - Web Search Algorithms - 1 - Why web search in this module? WWW is the delivery platform and the interface How do we find information and services on the web we try to generate a url that seems sensible

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

[DIGITAL MARKETING PROPOSAL TO WEBSITE NAME]

[DIGITAL MARKETING PROPOSAL TO WEBSITE NAME] [DIGITAL MARKETING PROPOSAL TO WEBSITE NAME] About RAKESH TECH SOLUTIONS We at RAKESH TECH Solutions are committed to provide you the best solution in Digital Marketing and also best support in the industry.

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

AURA ACADEMY Training With Expertised Faculty Call us on for Free Demo

AURA ACADEMY Training With Expertised Faculty Call us on for Free Demo AURA ACADEMY Training With Expertised Faculty Call us on 8121216332 for Free Demo DIGITAL MARKETING TRAINING Digital Marketing Basics Basics of Advertising What is Digital Media? Digital Media Vs. Traditional

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

Telling Experts from Spammers Expertise Ranking in Folksonomies

Telling Experts from Spammers Expertise Ranking in Folksonomies 32 nd Annual ACM SIGIR 09 Boston, USA, Jul 19-23 2009 Telling Experts from Spammers Expertise Ranking in Folksonomies Michael G. Noll (Albert) Ching-Man Au Yeung Christoph Meinel Nicholas Gibbins Nigel

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

DIGITAL MARKETING TRAINING. What is marketing and digital marketing? Understanding Marketing and Digital Marketing Process?

DIGITAL MARKETING TRAINING. What is marketing and digital marketing? Understanding Marketing and Digital Marketing Process? DIGITAL MARKETING TRAINING CURRICULUM Overview of Digital Marketing What is marketing and digital marketing? Understanding Marketing and Digital Marketing Process? Website Creation Understanding about

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Addressing the Challenges of Underspecification in Web Search. Michael Welch

Addressing the Challenges of Underspecification in Web Search. Michael Welch Addressing the Challenges of Underspecification in Web Search Michael Welch mjwelch@cs.ucla.edu Why study Web search?!! Search engines have enormous reach!! Nearly 1 billion queries globally each day!!

More information

How To Construct A Keyword Strategy?

How To Construct A Keyword Strategy? Introduction The moment you think about marketing these days the first thing that pops up in your mind is to go online. Why is there a heck about marketing your business online? Why is it so drastically

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

RECOMMENDATIONS HOW TO ATTRACT CLIENTS TO ROBOFOREX

RECOMMENDATIONS HOW TO ATTRACT CLIENTS TO ROBOFOREX RECOMMENDATIONS HOW TO ATTRACT CLIENTS TO ROBOFOREX Your success as a partner directly depends on the number of attracted clients and their trading activity. You can hardly influence clients trading activity,

More information

Approaches to Mining the Web

Approaches to Mining the Web Approaches to Mining the Web Olfa Nasraoui University of Louisville Web Mining: Mining Web Data (3 Types) Structure Mining: extracting info from topology of the Web (links among pages) Hubs: pages pointing

More information

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia haixu@microsoft.com July 24, 2007 1 Outline History of Search Engine Difference Between Software and

More information

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Boolean Retrieval Weighted Boolean Retrieval Zone Indices

More information

Technology in Action Complete, 13e (Evans et al.) Chapter 3 Using the Internet: Making the Most of the Web's Resources

Technology in Action Complete, 13e (Evans et al.) Chapter 3 Using the Internet: Making the Most of the Web's Resources Technology in Action Complete, 13e (Evans et al.) Chapter 3 Using the Internet: Making the Most of the Web's Resources 1) The Internet is. A) an internal communication system for businesses B) a communication

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information

More information

Module 1: Internet Basics for Web Development (II)

Module 1: Internet Basics for Web Development (II) INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency

More information

Query Refinement and Search Result Presentation

Query Refinement and Search Result Presentation Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the

More information

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? In our experience, we find we can get over-excited when talking to clients or family or friends and sometimes we forget that not everyone

More information

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog Advertising Network A group of websites where one advertiser controls all or a portion of the ads for all sites. A common example is the Google Search Network, which includes AOL, Amazon,Ask.com (formerly

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

SEO and Monetizing The Content. Digital 2011 March 30 th Thinking on a different level

SEO and Monetizing The Content. Digital 2011 March 30 th Thinking on a different level SEO and Monetizing The Content Digital 2011 March 30 th 2011 Getting Found and Making the Most of It 1. Researching target Audience (Keywords) 2. On-Page Optimisation (Content) 3. Titles and Meta Tags

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Digital Marketing Overview of Digital Marketing Website Creation Search Engine Optimization What is Google Page Rank?

Digital Marketing Overview of Digital Marketing Website Creation Search Engine Optimization What is Google Page Rank? Digital Marketing Overview of Digital Marketing What is marketing and digital marketing? Understanding Marketing and Digital Marketing Process? Website Creation Understanding about Internet, websites,

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Graph and Link Mining

Graph and Link Mining Graph and Link Mining Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes V = v,, v 5 Set of edges E = { v, v 2, v 4, v 5 }

More information

Marketing & Back Office Management

Marketing & Back Office Management Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen

More information

Why it Really Matters to RESNET Members

Why it Really Matters to RESNET Members Welcome to SEO 101 Why it Really Matters to RESNET Members Presented by Fourth Dimension at the 2013 RESNET Conference 1. 2. 3. Why you need SEO How search engines work How people use search engines

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Digital Marketing for Small Businesses. Amandine - The Marketing Cookie

Digital Marketing for Small Businesses. Amandine - The Marketing Cookie Digital Marketing for Small Businesses Amandine - The Marketing Cookie Search Engine Optimisation What is SEO? SEO stands for Search Engine Optimisation. Definition: SEO is a methodology of strategies,

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Welcome to the class of Web Information Retrieval!

Welcome to the class of Web Information Retrieval! Welcome to the class of Web Information Retrieval! Tee Time Topic Augmented Reality and Google Glass By Ali Abbasi Challenges in Web Search Engines Min ZHANG z-m@tsinghua.edu.cn April 13, 2012 Challenges

More information

CS/INFO 1305 Summer 2009

CS/INFO 1305 Summer 2009 Information Retrieval Information Retrieval (Search) IR Search Using a computer to find relevant pieces of information Text search Idea popularized in the article As We May Think by Vannevar Bush in 1945

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

DIGITAL MARKETING For your Company

DIGITAL MARKETING For your Company DIGITAL MARKETING For your Company www.almada.co 1 About Us Established in 1998 with 8 developer team and 42 offshore team, a PCI DSS, ISO 27001, 9001 certified Data Center & service provider, a world-leading

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

21. Search Models and UIs for IR

21. Search Models and UIs for IR 21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in

More information

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert.

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert. SEARCH + SOCIAL Gary Viray Founder, Search Opt Media Inc. Goo gol Google Algorithm Change Google Toolbar December 2000 Birth of Toolbar Pagerank They move the toilet mid stream. 404P Pages are ranking

More information

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE What is Search Engine Optimization? SEO is a marketing discipline focused on growing visibility in organic (non-paid) search engine results. Why

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document

More information

ITP 140 Mobile Technologies. Mobile Topics

ITP 140 Mobile Technologies. Mobile Topics ITP 140 Mobile Technologies Mobile Topics Topics Analytics APIs RESTful Facebook Twitter Google Cloud Web Hosting 2 Reach We need users! The number of users who try our apps Retention The number of users

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

6 WAYS Google s First Page

6 WAYS Google s First Page 6 WAYS TO Google s First Page FREE EBOOK 2 CONTENTS 03 Intro 06 Search Engine Optimization 08 Search Engine Marketing 10 Start a Business Blog 12 Get Listed on Google Maps 15 Create Online Directory Listing

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea What is this course about? Processing Indexing Retrieving textual data (or audio, video, geo-spatial,, data) Fits in four

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Intro to Peer-to-Peer Search

Intro to Peer-to-Peer Search Intro to Peer-to-Peer Search (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Outline Peer-to-peer historical perspective Problem definition Local client data processing Ranking functions Metadata copying

More information