doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Size: px

Start display at page:

Download "doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague"

Branden Ray
5 years ago
Views:

1 Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

2 Searching the Web and Multimedia database systems computer graphics, computer vision! information retrieval Databases T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 2

3 it is about retrieving information from web, searching, browsing, querying web pages (text, hyper-text, social networks) multimedia documents/objects (content-based search) it is not about web application architectures user interfaces not related to search networking, protocols, and other low-level infrastructure stuff T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 3

4 the web space data, multimedia, and communities on the web web search engines history web crawling, indexing, searching multimedia retrieval brief overview of web retrieval modalities browsing, queries, filtering, social networks T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 4

5 World Wide Web (WWW) founded by Tim Berners-Lee (CERN) in 1989 an internet graph of web pages + other resources hosted on web servers communication over the Internet using HTTP (hyper-text transfer protocol) GUI-based internet space T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 5

6 web page hyper-text document written in HTML (hyper-text markup language) or descendants, like XHTML, PHP text including links to resources using URL (uniform resource locator) either other web page, or resource on the internet (multimedia object, flash application, etc.) depending on HTML command, the URL source could be shown as link, or downloaded (and embedded) within the referencing web page node in the Web graph (URL links are the out-edges from that node) web site web subgraph of related web pages, e.g., personal web, newspapers, etc. T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 6

7 Web (1.0) the first 15 years of WWW personal websites, publishing, rather static content passive browsing dominant Web 2.0 the WWW era since 2004 advancements due the increased availability of high-speed internet connection social networks, blogs, site aggregations, information society participation instead of passive browsing, interactivity other devices than PCs (general electronics, mobile phones, etc.) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 7

8 more than 95% (99.9%?) of web space is considered to store multimedia content 100 billions of photographs are expected to be taken every year factors high-speed internet, increasing computational power cheap digital devices camera, camcoders, all-in-one devices (PDA, smart phones) everyone is producer human activities move to internet in a large extent social networks industry (banking, retail, services,...) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 8

9 database data-centric concept for general retrieval and sharing of homogenous collections of data (relational, text, multimedia) fixed structure (limited annotation) simpler implementation web database with benefits integrates various contents, presentation and communication techniques heterogeneous collection of data (mixed data types) rich annotation possibilities, (social) context T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 9

10 real life moves into the Web at a large scale entertainment FaceBook, YouTube, Flickr, news work various cloud computing applications (webmail, GoogleDocs) shopping e-commerce study e-learning, student information systems state administration services e-government in all the mentioned Web activities, the essential is: retrieval of an information (about entity, person) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 10

11 Web search engine the major mean of the web information retrieval the web search engine should provide at least keyword-based (full-text) search additional information for ranking acquired from the Web graph (e.g., PageRank, HITS) meta-search engine engine that only aggregates results returned by several other web search engines could be more effective (e.g., good for retail business competition (e.g., T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 11

12 information retrieval (since 1960s) models and indexing techniques for full-text search older than Web, i.e., not designed for web retrieval, rather digital libraries/archives, collections of full-text documents three basic models boolean, vector, probabilistic with popularity of Web, it included also the link analysis (1998) web retrieval prior to search engines only web directories edited by humans hard to find any web page not listed! e.g., central list of web servers hosted on a CERN web server T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 12

13 the pioneers ( ) Archie search in downloaded directory listings of public FTPs (1990) W3Catalog first search engine in catalogues (manually maintained) World Wide Web Wanderer first robot (crawler) followed by AltaVista, Inktomi, Yahoo! (though only directory search) the Google era (since 1998) PageRank algorithm gave much better results than existing engines based just on the full-text search (classic information retrieval models) success also due to simple GUI, thus fast download in the slow-speed internet age (not downloading Ads and unnecessary portal GUI) followed by Bing, Yahoo! Search but Google still has over 90% of the market T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 13

14 typical elements of a web search process crawling downloading the content (web pages) indexing processing the content into a form suitable for search searching retrieving the relevant content using a query web Crawler module Page repository user Indexing module Indexes content idx Query module Ranking module special idx structure idx T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 14

15 crawler = a short program that instructs spiders (or robots/bots) on how and which pages to retrieve robot/bot/spider = autonomous agent that visits URLs a root set of URLs is given to the spider the spider visits pages on that URLs + their links to a certain depth because of the Web size, crawler must carefully select which pages to visit e.g., only Czech pages, or just.edu pages crawler must observe which pages change frequently repeated visits of the spiders (each day/week/month) prioritized visits, e.g., using PageRank (limited resources) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 15

16 ethic crawling spider consumes bandwidth and hit quotas of the hosting server and other Internet infrastructure polite spiders minimize their impact (like outdoor hiker that tries to leave no trace in the nature) Robot Exclusion Protocol defines proper spider activities, punishing obnoxious spiders admins can use a robots.txt file to block spiders accessing the server optimal crawling policy needed to ensure web pages are not visited multiple times, thus minimizing the communication overhead T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 16

17 the spiders return with a set of URLs for new/refreshed pages that need to be added/updated in the search engine s indexes some authors worry that spider might never find their website they can check web server s logs every web search engines has a Submit-Your-Site mechanism how to add such a page on the list, e.g., Google offers a submission tool at other resources on web crawlers book Spidering Hacks offers tricks and tips for crawler implementation book Numerical Computing with Matlab contains an example of crawler written in Matlab, see file surfer.m at T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 17

18 the web pages returned by spiders need to be organized in a form suitable for searching, the whole collection of downloaded web pages = a database of full-texts (or the corpus) naively, the search engine would scan every document in the corpus for each query submitted by a user as this is inefficient, there is a need for an index, allowing to process only a small fraction of the corpus for a query many various indexes content index allowing the full-text search efficiently structure/citation index allowing to take page ranking into account special purpose index, e.g., for content-based multimedia search T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 18

19 index design factors merge factors check if new is added or old is updated, how to merge two indexes created asynchronously in parallel storage techniques how to store/compress/filter data index size how much computer storage is required lookup speed how quickly a word can be found in an index maintenance fault tolerance how important it is for the service to be reliable (dealing with index corruption, bad HW, replication, etc.) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 19

20 based on the retrieval model, the indexes are implemented by various data structures inverted file hash table or tree storing occurrences of pages for each search term/predicate (used in Boolean and vector models) signature files/trees storing a binary vector for each page, where 1 means an occurrence of a term in the web page (Boolean model) suffix tree/array trie-like structure for efficient retrieval of words based on storing their suffixes the term-by-document matric a logical view of the corpus as a large sparse matrix (storing weight of each term in each web page) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 20

21 apart from indexes, the web search engines could store also the entire web pages in caches (even with multimedia content), to provide complete result even if the page is no longer available at the original URL Internet archive project an attempt of caching the whole Internet (starting in 1996) including the history of the pages (one can find old versions of websites) nowadays (Feb 2011) 150 billions web pages archived Waybackmachine T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 21

22 traditional web search engines full-text indexing + link analysis keyword or full-text queries keyword query (small) set of terms (words, phrases) full-text query text file (web page), parsed to keyword query multimedia web search engines content-based queries (in addition to keyword search) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 22

23 multimedia content raw content (e.g., considering pixels of an image) feature extraction needed vector/set of low-level features could be automated multimedia annotation high-level semantics (e.g., image of a ship) keywords/tags, text, URL, categories, GPS, context of social network human-annotated, (semi-)automated annotation T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 23

24 Example: a photograph hosted at Flickr raw content manual annotation automated annotation T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 24

25 query browsing filtering (recommendation) combinations query + browsing browsing + query filtering + query etc. T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 25

26 assumption we are able to specify our search intent query = explicit formulation of one-shot search intent query models keyword-based (annotation based) content-based file upload, URL, sketch query result binary relevance unordered set of objects multi-value relevance ranking on database objects relevance feedback, re-ranking T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 26

27 keyword query multiple answers web pages multimedia other T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 27

28 content-based query image search, query by sketch example T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 28

29 assumption we are not able to (well) specify our search intent browsing = iterative navigation in the database browsing models explicit graph (links between entities) virtual graph (series of queries) browsing result subset of database objects set of clusters (representatives) hierarchy (ontology) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 29

30 browsing explicit graph static hierarchy of categories dynamic network browsing virtual graph series of queries an object in query result is used to specify the new query T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 30

similar pages/objects query by keywords obtained from the annotation

31 (credit: dreamstime.com) explicit graph links in user s portfolio implicit graph links to similar pages/objects query by keywords obtained from the annotation weighted keywords weight = font size T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 31

32 filtering = formulation of fixed search intent explicit = static request query, subscription inexplicit = user preferences, profile, search history, collaborative filtering, etc. context of social networks filtering result dynamic, changing in time like database view T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 32

33 RSS channel + reader filtering/recommendation tool keyword query (credit: fotolia.com) T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 33

34 subscription to a channel membership in a group etc. T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 34

35 (credit: youtube.com) implicit filtering as an automatic query based on user s previous searches history of viewing content serves as relevance feedback T. Skopal (FIT CTU) Web space, search engines, web retrieval modalities (BI-VWM, Lect. 1) 35

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of