Web Search An Application of Information Retrieval Theory

Similar documents
World Wide Web, etc.

HTTP Protocol and Server-Side Basics

COSC 2206 Internet Tools. The HTTP Protocol

HTTP Request Handling

Applications & Application-Layer Protocols: The Web & HTTP

WEB TECHNOLOGIES CHAPTER 1

How to work with HTTP requests and responses

Parameterization in OpenSTA for load testing. Parameterization in OpenSTA Author: Ranjit Shewale

How the Internet Works

Analysis of the effects of removing redundant header information in persistent HTTP connections

MWR InfoSecurity Advisory. 26 th April Elastic Path Administrative. Quit. Session Hijacking through Embedded XSS

2002 Journal of Software, ),,,.

Introduction to HTTP. Jonathan Sillito

WWW Document Technologies

Produced by. Mobile Application Development. Higher Diploma in Science in Computer Science. Eamonn de Leastar

Lecture 3. HTTP v1.0 application layer protocol. into details. HTTP 1.0: RFC 1945, T. Berners-Lee HTTP 1.1: RFC 2068, 2616

Networking. INFO/CSE 100, Spring 2006 Fluency in Information Technology.

The TCPProxy. Table of contents

Web Programming 4) PHP and the Web

Configure ACE with Source NAT and Client IP Header Insert

CS 43: Computer Networks. Layering & HTTP September 7, 2018

HTTP Review. Carey Williamson Department of Computer Science University of Calgary

Assignment, part 2. Statement and concepts INFO-0010

Policies to Resolve Archived HTTP Redirection

HTTP Server Application

Common Gateway Interface CGI

Internet Architecture. Web Programming - 2 (Ref: Chapter 2) IP Software. IP Addressing. TCP/IP Basics. Client Server Basics. URL and MIME Types HTTP

Lecture 7b: HTTP. Feb. 24, Internet and Intranet Protocols and Applications

Computer Networks - A Simple HTTP proxy -

Configuring Virtual Servers, Maps, and Policies

HTTP, circa HTTP protocol. GET /foo/bar.html HTTP/1.1. Sviluppo App Web 2015/ Intro 3/3/2016. Marco Tarini, Uninsubria 1

Jeff Offutt SWE 642 Software Engineering for the World Wide Web

The HTTP protocol. Fulvio Corno, Dario Bonino. 08/10/09 http 1

CMSC 332 Computer Networking Web and FTP

DATA COMMUNICATOIN NETWORKING

Outline of Lecture 3 Protocols

C22: Browser & Web Server Communication

CS 455/555 Spring 2011 Weigle

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

Web Security, Part 2

OSI Session / presentation / application Layer. Dr. Luca Allodi - Network Security - University of Trento, DISI (AA 2015/2016)

Getting Some REST with webmachine. Kevin A. Smith

::/Topics/Configur...

Configuring Virtual Servers, Maps, and Policies

CS144 Notes: Web Standards

Threat Landscape 2017

Computer Networks. Wenzhong Li. Nanjing University

s642 web security computer security adam everspaugh

CS 43: Computer Networks. HTTP September 10, 2018

TCP/IP Networking Basics

AN4965 Application note

HTML forms and the dynamic web

Notes beforehand... For more details: See the (online) presentation program.

COMPUTER NETWORK. Homework #1. Due Date: March 29, 2017 in class

Using OAuth 2.0 to Access ionbiz APIs

Sirindhorn International Institute of Technology Thammasat University

Crowds Anonymous Web Transactions. Why anonymity?

vrealize Log Insight Developer Resources

Penetration Test Report

Inter-Protocol Communication

Application Layer: The Web and HTTP Sec 2.2 Prof Lina Battestilli Fall 2017

Global Servers. The new masters

Introduction to Internet, Web, and TCP/IP Protocols SEEM

CS 5450 HTTP. Vitaly Shmatikov

Web Application Security GVSAGE Theater

HTTP TRAFFIC CONSISTS OF REQUESTS AND RESPONSES. All HTTP traffic can be

Introducing FTP and HTTP Updated: 9/25/18

CS193i Handout #18. HTTP Part 5

Abusing Windows Opener to Bypass CSRF Protection (Never Relay On Client Side)

World-Wide Web Protocols CS 571 Fall Kenneth L. Calvert All rights reserved

Name Student ID Department/Year. Midterm Examination. Introduction to Computer Networks Class#: 901 E31110 Fall 2015

zend. Number: Passing Score: 800 Time Limit: 120 min.

Web Ontology for Software Package Management

You can also set the expiration time of the cookie in another way. It may be easier than using seconds.

exam. Number: Passing Score: 800 Time Limit: 120 min File Version: Zend Certified Engineer

Information systems modeling. Tomasz Kubik

Distributed Systems Project 1 Assigned: Friday, January 26, 2018 Due: Friday, February 9, 11:59 PM

CS 355. Computer Networking. Wei Lu, Ph.D., P.Eng.

World Journal of Engineering Research and Technology WJERT

ECE697AA Lecture 2. Today s lecture

8/8/17. HTTP Overview. It s all about the network. If you want to really do Web programming right you will need to know the ins and outs of HTTP

Cross-Site Request Forgery in Cisco SG220 series

Introduc)on to Computer Networks

HTTP and HTML. We will use HTML as a frontend to our webapplications, therefore a basic knowledge of HTML is required, especially in forms.

1 A file-based desktop search engine: FDSEE

Server-Side Web Programming: Python (Part 1) Copyright 2017 by Robert M. Dondero, Ph.D. Princeton University

How A Website Works. - Shobha

vrealize Log Insight Developer Resources Update 1 Modified on 03 SEP 2017 vrealize Log Insight 4.0

Web History. Systemprogrammering 2006 Föreläsning 9 Web Services. Internet Hosts. Web History (cont) 1945: 1989: Topics 1990:

Markup Language. Made up of elements Elements create a document tree

Secure Web App. 제목 : Secure Web Application v1.0 ( 채수민책임 ) Copyright 2008 Samsung SDS Co., Ltd. All rights reserved - 1 -

Application Protocols and HTTP

Giving credit where credit is due

TCP/IP Networking An Example

HTTP Requests and Header Settings

20.5. urllib Open arbitrary resources by URL

Web Server Project. Tom Kelliher, CS points, due May 4, 2011

Configuring Health Monitoring

ECE/CS 3710 Computer Design Lab Lab 2 Mini-MIPS processor Controller modification, memory mapping, assembly code

Genie Snoop lab. Laboration in data communication GenieLab Department of Information Technology, Uppsala University

Transcription:

Web Search An Application of Information Retrieval Theory Term Project Summer 2009 Introduction The goal of the project is to produce a limited scale, but functional search engine. The search engine should be able to provide a list of relevant documents when a query is given, just like any commercial search engine would do. It is in a limited scale that it is required to collect a limited number of documents (e.g. in the order of a few hundreds to a few thousands). The more your search engine can collect, the better it is. This is a multi-phase, team project. It will start from the beginning of the semester and last through the semester. The detailed scope of the project, the team organization, technical information, and other details will be given as the semester progresses. An overview and the first part of the project will be given here. Components of a Search Engine A search engine consists of a collection of software components that work together to accomplish the task of collecting, analyzing a large number of documents over the Internet and giving the user a list of relevant documents and URLs when a query is issued to the search engine. Major components of a typical search engine include the following: a user interface which takes the user input, passes the query to the back-end engine, displays the results sent back by the engine; a crawler which visits the Web and collects information about all the documents it encountered from the Web; an indexer which indexes each of the pages collected from the Web by the crawler and establishes links between keywords and documents that contain these keywords; a ranker/retriever which ranks the documents for a given query according to certain measures and retrieves the top relevant documents for the user; a back-end engine which takes care of network and file operations. Figure 1 indicates the relation among different components in a typical search engine. A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One is on the left side of the document collection, which answers user s queries. The other is 1

WEB User Interface Engine Document Collection Crawler Indexer Retriever /Ranker Figure 1: Components of a Typical Search Engine on the right side of the document collection, which collects information from the Web so the URLs related to the user queries can be retrieved. A crawler goes around the Web to read Web pages and to extract information about each Web page it reads. The information is then sent to an indexer. The indexer takes this information and creates the links between keywords and the documents that contain these keywords. The result is typically saved into a file or a collection of files. When a user issues a query the document list is searched and a collection of relevant documents is generated. The ranker is responsible to rank these documents according to certain algorithms and measures. The top ranked documents are returned to the user for review. It is possible for the user at this point to review the documents and send feedbacks to the search engine. The ranker may take these feedback into account and re-select or re-rank the documents for the user to view. Project Team The project will be carried out in teams. The details of the team work are given in a separate handout. Your Work in Phase One and Some Technical Details Your phase one work is to implement a basic version of the interface and the back-end engine. This is just a framework. As the project progresses, some of these components will need to be enhanced. The interface part is responsible for the following main tasks. Display a main search engine page (you should design this page carefully, it will be the gateway of your search engine to the world). Read the user inputs from the browser. Display the results generated by the search engine. The back-end engine takes care of the operations of files and network. It does the following. Maintain the communication between the browser and the search engine. Read and parse the inputs from the browser. 2

Retrieve and send the results of the query to the browser. There are some code examples in the directory of <http://localhost:9999/webpages/code/javaserver/> for this part of the project. Students who use C++ may find a similar example in C in the directory of <http://localhost:9999/webpages/code/cserver/>. You should before you design and develop your own program. Do the following. Download these program examples and save them into a directory of your own, e.g. WebServer. Compile the programs by javac EasyWebServer.java Run the program by java EasyWebServer port# where the port# is an integer of your choice. There are two restrictions for this number. It has to be greater than 1024. Port numbers below 1024 are reserved for system applications (for example, email server at port 25, telnet server at 23 and ftp server at 21). The second restriction is that the port number has to be unique. There cannot be two applications running at the same port on the same machine. Now that your simple Web server is running, point your favorite Web browser at http://server:port/ to see the results generated by the server. The server is the name of the computer where your Easy- WebServer is running, e.g. localhost and the port is the port number you chose to run the server in the previous step. For example, if the Web server is running on localhost at port 9999, then you should point your Web browser at http://localhost:9999/ Try a few different paths on the Web server such as http://localhost:9999/sample.html http://localhost:9999/search Then read the programs and make sure you understand what the program is doing. Let s discuss some technical details as we walk through an example. Assume the search engine is running at host localhost at port 9999. We have issued the URL in our browser as http://localhost:9999/search An HTML form will be displayed in the browser as the result. We typed 123 in the first input box and abc in the second input box and then we clicked on the button Submit. Let s now exam what happens between the browser and the server. When a Web browser contacts a Web server, it sends, among other things, a command of the form to the server 3

get /path http/1.1 \r\n\r\n This means the browser is requesting (get) a page specified by /path and the protocol that the browser is using is HTTP 1.1. It is required to have two consecutive new lines to end the command, each of which consists of a new-line character and a carriage-return character. In our example, the command sent to the server from the browser is get /search http/1.1 \r\n\r\n This resulted in the form to be displayed to the browser s screen. The server has to parse this command to understand what the browser wants to do. A get command indicates that the browser is requesting some Web page(s). A post command indicates that the browser is sending some information to the server, for example, a search query. The browser also sends the length of the input to the server when it is requesting to post a form. This length tells the server how many bytes the browser is sending to the server as contents. The server is then expecting to read the number of bytes specified by the content length. In our example, when we fill in the form and click the Submit button, the browser is sending a post request to the search engine. The content of the form is sent to the server as the actual content after the header information. The following is a sample string the browser sends to the server when it is requesting a regular Web page. When you run the sample program, you will see this echoed on your server screen. GET / HTTP/1.0 Connection: Keep-Alive User-Agent: Mozilla/4.78 [en] (X11; U; SunOS 5.8 sun4u) Host: polaris:9999 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: en Accept-Charset: iso-8859-1,*,utf-8 The following is a sample string the browser sends to the server when it is posting a form to the server. When you run the sample program, you will see this echoed on your server screen. POST /form HTTP/1.0 Referrer: http://polaris:9999/search Connection: Keep-Alive User-Agent: Mozilla/4.78 [en] (X11; U; SunOS 5.8 sun4u) Host: polaris:9999 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: en Accept-Charset: iso-8859-1,*,utf-8 Content-type: application/x-www-form-urlencoded Content-length: 44 4

Note that the last line indicates that the content length is 44 characters. In this case, after the header part (in the above display) the browser is sending a form containing data to the server in the following format. FirstInput=123&SecondInput=abc&Submit=Submit which is exactly 44 characters. This string represents the form that is sent from the browser. The content length (44) and the content string depends on your input to the form. In our example, we typed 123 in the first input box and abc in the second input box. The content string is formed by the browser in the format of a sequence of name/value pairs. The name and value in a pair is separated by an equal sign = and the pairs are separated by an ampersand sign &. In our example, FirstINput, SecondInput, Submit are the names of the form entries (read form.html to see where they are specified) and 123, abc, Submit are the values corresponding to these names. Once the server parses the input form string, it captures all the input values. The server can then act accordingly. If this is a search engine, the server can pass this value as the query string to the ranker/retriever to retrieve relevant documents. In this phase of the project, you may just echo the query term to indicate that the server understands the query. The actual retrieval operation will be implemented in a later phase. What to Hand In Your team needs to hand in the following in the order given. Please staple all pages. 1. A team report for phase one of the project with a cover sheet. The report shouldn t be too long, maybe two to three pages. The report should include the following as a minimum. (a) The name of your team (should be same as your search engine); (b) A description of the roles of each team member and the contributions of each member; (c) A summary of the working process, e.g. what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others. (d) Team meeting minutes during this phase of the work. 2. Source code for the programs and HTML pages. 3. Snapshots of sample runs from a Web browser. 4. Email the instructor a copy of the complete source code in zip or tar format. 5