A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE. Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds

Size: px

Start display at page:

Download "A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE. Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds"

Ruth Barnett
5 years ago
Views:

1 A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds The candidate confirms that the work submitted is his own and appropriate credit has been given where reference has been made to the work of others.

2 Abstract This project has set out to discuss and review local web site search engine and information retrieval techniques. A web search engine for SCS intranet websites was built to aid the research. Objectives: The primary objectives of the project are as following: 1. Study and explore various Information retrieval techniques for improving effectiveness (optimise precision and recall) 2. Build a web search engine for the use of searching the Computer Studies web pages. Practise IR techniques in the search engine. 3. Test and evaluate the searching system To assess the impact of the IR techniques as they are incorporated into the implementation I have reviewed and explored the major information retrieval technologies and build a web search engine that adopt various approaches in these IR techniques. Moreover, the impacts of these IR techniques were evaluated by some experiments. The Personal objectives of the project were: 1. Extend my theoretical and practical knowledge in Information retrieval field 2. Gain experience of CGI Programming 3. Produce a fully working product The Deliverables of this project were: 1. A working SCS web search engine 2. This project report All of the objectives and deliverables of the project have been completed. This project gave me better understanding of the working and time scheduling of projects as well as specific problems, which can occur in projects. 1

3 Acknowledgements I would like to thank my project supervisor, Dr Stuart Roberts, for his very useful advice and support throughout the project. Moreover his wife, Mrs Ann Roberts and their lovely sons gave helpful contributions to the evaluation process. Credit should also be given to Dr Nick Efford and Dr Peter Jimack who have given me some useful advices as well. Thanks also go everyone who has helped during this project. Last but not least, especially thanks go to my wife for being supportive throughout this project. 2

4 Contents Abstract 1 Acknowledgement 2 Contents 3 Chapter 1 Introduction Background Local Site Search Components Structure of Local Site Search Engine System Evaluation 9 Chapter 2 Modelling Techniques Introduction Boolean Model Classical Vector Model Probabilistic Model Vector and Vector Space Models Term Weighting Systems Basic Vector Model Vector Space Models Chapter 3 User Interface Design Techniques Natural Language Intelligent or artificial agents Visualisation User Interface Design in WWW 24 3

5 Search Engine Chapter 4 System Design Analysing the current SCS Intranet web sites The System architecture Building Index file process and Hits Ranking User Interface Design Administrator interface Design 31 Chapter 5 System Implementation Programming Language Choosing Find an appropriate template search tools Adding Stemming Algorithm Remove More Fluff Words Improving the User Interface Using Vector Model 35 Chapter 6 System Testing, Evaluation and Conclusion 6.1 Testing Environment 6.2 Evaluation Criteria 6.3 Evaluation Process 6.4 Evaluation Results 6.5 Conclusions and future thoughts Appendix A 43 Appendix B 44 4

6 Appendix C 45 Appendix D Porter stemming algorithm 49 Appendix E 52 Reference 55 5

7 Chapter 1 Introduction 1.1 Background The world-wide Web is a large distributed digital information space. As we know, it started as an organization-wide collaborative environment for sharing research documents in nuclear physics at CERN in Nowadays, the web is becoming a universal repository of human knowledge and culture, which encompasses diverse information resources including personal Web pages; online digital libraries; virtual museums; product and service catalogs; government information for public dissemination; research publications; and Gopher, FTP, Usenet news, and Mail servers. However, how to find useful information on the web is frequently a tedious and difficult task. To satisfy the information need, user might navigate the space of web links search for expected information. There's a paradox: the more information the site has, the more useful it is, and the harder to navigate! Since the whole web space is vast and almost unknown, navigation task is usually inefficient. For naive users, the problem becomes much more difficult, which might entirely frustrate all their efforts. As the WWW started to explode in terms of users, servers, and pages, it became obvious that search capabilities had to be added, and a flourishing market of public search engines emerged. The first Web-based Search engines come into existence in Regular users of the Internet's World Wide Web are very familiar with the sites that have been developed that allow users to search across all or a part of the Web. Examples of these sites include Infoseek, Altavista, Hotbot, etc. They are excellent tools for finding information that is stored across the Web and can assist users in finding information that would otherwise be difficult to locate. Later, the growing number of intranets, i.e. intraorganisational webs (SCS intranet website) hidden from the Internet behind firewalls or 6

8 proxies, has created a need not only for public web search services but also for the internal web search product. IR is currently maturing together components from many constituent research communities, each with their own traditions and characteristics. These constituents include mathematical modelling of information using logical and probabilistic approaches, and modelling the information seeking process of searchers. These approaches are being added to a strong experimental and hypothesis testing tradition with IR research, which itself is being augmented by more psychological style experiments introduced to the computing science community via human computer interaction and cognitive science research. Nowadays, research in the IR includes modelling, document classification and categorization, system architecture, user interfaces, data visualization, filtering, languages, etc. For building a local web site search engine, it will involve in applying one or more above techniques to improve the searching performance. The objectives of the project is to explore various Information Retrieval techniques for improving effectiveness and build a local site web search engine (SCS web search engine) for the use of searching the Computer Studies local web pages. 1.2 Local Site Search Components A local site information retrieval system can be divided into several relatively separated components. They can be built and maintained individually. Taking the SCS search engine for example, it includes following parts. Search Engine is the key component of the system. The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results page to the server. In our search engine, CGI Programs is written in Perl. It is an easy and powerful programming language. Perl scripts can be used on most platforms and communicating with most web servers using the CGI standard. A site visitor filling in data and clicking a Search button on an HTML form invokes the site 7

9 search CGIs. They will take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page. However, there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. The Search Indexer program creates Search Index File. The file stores the data from web site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of the site, this file can become very large. In the SCS search engine, several text files are playing this role. These files can be updated through friendly interface in order to keep them synchronised with the pages and avoid providing obsolete results. On the other hand, creating or updating index file is a time consuming task. Providing a HTML interface - Search Forms for visitors to enter their search terms and specify their preferences for the search is a necessary part for any web search engine. Finally, how to lists the web pages, which contain text matching the search term(s) is the forth part, which need take into consideration when we design the search engine. These retrieved hits are sorted in some kind of relevance order, usually based on the number of times the search terms appear according to its particular algorithm and whether they're in a title or header. Most results listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). In the SCS search engine, it also includes the date modified, file size, and URL. 1.3 Structure of Local Site Search Engine Same as most of web search engine, the SCS search engine is the application, which searches the data and returns the results to the client. This means creating an HTML page in the specified format. It searches within an index, created by an Indexer application. The users enter their search terms in a text field, and may select appropriate settings in the form. When they submit their queries, the server passes that data to the search engine application. 8

10 Figure 1.1 Architecture of Search Engine [13] Once the database has been searched, the results will be returned to the user in a usable format. The format should include enough of a description of the records or documents returned to allow the user to make a decision about which document he/she wishes to display. 1.4 System Evaluation The most common measures of search system performance are time and space. Especially in a system designed for providing data retrieval, the response time and the space required are usually the metrics of most interest and importance adopted for evaluating the searching system. In a system designed for providing information retrieval, the retrieved documents are not exact answers and have to be ranked according to their relevance to the query. Thus, besides time and space, recall and precision are also very important as retrieval evaluation measures. Recall is measured as the ratio of the number of relevant documents retrieved to the total number of relevant items that exist in the collection, and precision is measured as the ratio of the number of relevant documents retrieved to the total number of documents retrieved. A desirable IR system is one that achieves high precision for most levels of recall (if not all). 9

11 In fact, the search engines do not search the "entire" web pages in "real time"; they search databases, which have been created from resources on the Internet. Therefore, the nature and content of the databases are relevant factors in evaluating the search engine as well. The best way to evaluate a search engine is to employ "real end-users or real-life queries". Besides collecting hard data from original files, we sent the Evaluation form (Appendix E) to the real user, and get the feedback to analysis. In order to assess the affections of various techniques on information retrieval system performance separately, SCS search engine was designed to run in various conditions. We had evaluated the search engine in these conditions and compared the feedback results together. 10

12 Chapter2 Modelling Techniques 2.1. Introduction In information retrieval system, it is customary to represent each stored record and each information requests by sets of content identifiers, or terms. The terms attached to the items may be assigned automatically or chosen manually; in either case, the terms used for a given item collectively represent the information content of the item. An index term is a keyword (or group of related words), which has some meaning of its own (i.e., which usually has the semantics of a noun). Retrieval based on index terms is simple but raises key questions regarding the information retrieval task. In fact, a lot of the semantics in a document or user request will be lost when we replace its text with a set of words. Furthermore, matching between each document and the user request is attempted in this very imprecise space of index terms. Thus, the documents retrieved in response to a user request expressed as a set of keywords are frequently irrelevant. How to predict which documents are not relevant and which are relevant, even in which degree they are is a central problem of any modern IR system. In the web search system, the target documents will be hundreds of web pages. A ranking algorithm operates according to basic premises regarding the notion of document relevance. Distinct sets of premises (regarding web page relevance) yield distinct information retrieval models. A lot of model can be adopted to rank the documents, but the vector model is usually preferred due to its simplicity. Before begin to design the searching system, we reviewed the currently popular modelling techniques and choose an appropriate one for building our web search engine. There are three modelling types in information retrieval fields: the Boolean, the vector, and the probabilistic model. The classic models in information retrieval consider that 11

13 each document is described by a set of representative keywords called index terms. An index term is simply a document word whose semantics helps in remembering the document s main themes. In this project, these modelling techniques have been investigated for designing the SCS search engine. Boolean Model The Boolean model is a simple retrieval model based on set theory and Boolean algebra. The Boolean model provides a framework, which is easy to grasp by a common user of an IR system. However weights are not assigned to designate term importance. Instead, a term is either used to identify a given item or it is not: when assigned, the term may be assumed to carry a weigh of 1; otherwise it carries a weight of 0. Given its inherent simplicity and neat formalism, the Boolean model received great attention in past years and was adopted by many of the early commercial bibliographic systems and data retrieval system. The original template search engine adopts this type of model, which is very easy for designing. On the other hand, the Boolean model suffers from major drawbacks. First of all, its retrieval strategy is based on a binary decision criterion without any notion of a grading scale, which prevents good retrieval performance. For instance, although it maybe simplifies the input processing, the retrieval operations may become complicated by the fact that in a binary indexing system the documents retrieved in response to a given query are indistinguishable from each other. All retrieved items are treated as equally close to the query, because the number of terms assigned jointly to the query and the retrieved items is the same for all items. This leads to the retrieval by the system user of potentially large classes of items that are difficult to deal with. Classical Vector Model To sort the above limitation of binary weights, the easiest way of introducing distinction among classes of retrieved items is to use weighted instead of binary index terms to identify queries and documents. Thus, the vector model comes up accomplished by assigning non-binary weights to index terms in queries and in documents (web pages in the SCS search engine). In such a situation it becomes possible to compute the degree of 12

14 similarity between each document stored in the system and the user query. The records can then be retrieved in a ranked order. By sorting the retrieved web pages in decreasing order of this degree of similarity, the vector model takes into consideration documents, which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model. In the practice, a large variety of alternative ranking methods have been compared to the vector model but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives [3]. Its additional advantages are simple and fast. Based on these reasons, the vector model is the most popular retrieval model among researchers, practitioners, and the Web community. Probabilistic Model The probabilistic model attempts to capture the IR problem within a probabilistic framework. It is based on the following assumption: Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be non-relevant. In this model, the index term weight variables are all binary. [3] 13

15 In theory, the main advantage of the probabilistic model is that documents are ranked in decreasing order of their probability of being relevant. However its disadvantages is also obvious. It need guess the initial separation of documents into relevant and non-relevant sets. The method doesn t take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary); and the adoption of the independence assumption for index terms. In this project, we intend to use the Vector model to build the system because this model is simple and yet powerful. The vector operations can be performed efficiently to handle very large collections. Furthermore, it has been shown that the retrieval effectiveness is significantly higher compared to that of Boolean retrieval models. So that we will discuss this model further in the next section. 2.2 Vector and Vector Space Models Term Weighting Systems Term weighting methods are used to place different emphases on a term s (or a key word s) relationship to the other terms and other documents in the collection. Currently there are several mathematical models being used to relate the term precision weights to the frequency of occurrence of the terms in a given document collection and to the number of relevant documents a user wishes to retrieve in response to a query. In this section, we will discuss the main idea behind the most popular and effective termweighting techniques. Automatic indexing techniques are statistical. It is based on Luhn s hypothesis: the frequency of word occurrence in an article furnishes a useful measure of word significance. Normally high frequency terms tend to be too common in text generally to be of use, low frequency terms are considered unlikely to characterise the central information content of the document, so the idea is to measure the frequencies of words and apply two cutoffs, preserving only the mid-frequency words. 14

16 In the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj. The index terms can be weighted proportionally to the frequencies of occurrence. Let freqij be the frequency with which f ij = freq ij / max ( freq l= 1.. t lj ) term ki occurs in dj and define the normalised frequency, fij, to be: Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection. In general sense, a good index term will both describe the document well, but also distinguish that document from all others in the collection. This factor is usually referred to as the inverse document frequency. How well an index term distinguishes a document can be measured by the inverse frequency of the occurrence of the term in all documents. Define inverse document frequency idfi for ki: idfi = log(n/ni) Here, N docs in a collection ni are indexed by term ki. After combining the term frequency and the inverse document frequency, the bestknown term-weighting schemes use weighting schemes, which are given by: w ij = fij log( N/ni) Finally, the weights can be normalized to ensure they are between 0 and 1: w ij = fij log( N/ni) t = l 1 ( flj log( N/ni) ) 2 Basic Vector Model For the vector model, the weight wij associated with a pair (ki, dj) is positive and non-binary. Further, the index terms in the query are also weighted. Let Wi,q be the weight associated with the pair [ki, q], where wi,q 0. Then, the query vector q is defined as q = (Wl,q, W2,q,... Wt,q) where t is the total number of index terms in the system. As above, the vector for a document dj is represented by dj = (W1j, W2j, - - -, Wtj). 15

17 Therefore, a document dj and a user query q are represented as t-dimensional vectors. The vector model proposes to evaluate the degree of similarity of the document di with regard to the query q as the correlation between the vectors di and q. This correlation can be quantified by the cosine of the angle between these two vectors. sim(d j, q) = d j * q /( d j q ) T1 D1 n (w ij w Ei=1 iq) n w 2 n w 2 Ei=1 ij Ei=1 iq = cos(ß) w 11 w 1q ß Q w21 w2q T2 Fig 2.1 Vector Space Models A vector space model is an alternative algebraic model, which can be used to represent both terms and documents in a text collection. Contrary to the basic Boolean query model, the vector space model allows finding the documents, which are the most similar to the query without the need for a 100 percent match. In the vector space model, both queries and documents are represented as term vectors of the form Di = (di1, di2,...,dit) and Q = (q1, q2,...,qt). A document collection is then represented as a term-document matrix A: Fig

18 The similarity between a query vector Q and a document term vector D can then be computed as: This method of computing similarity coefficients between queries and documents is particularly advantageous because it allows one to sort all documents in decreasing order of similarity to a particular query. This also permits one to adapt the size of the retrieved document set to the user's needs. Here, we just take an example in a low-dimensional space. Documents D1 D2 D3 D4 D5 D6 D7 Terms T1: Information T2: Leeds T3: Research T4: Vacancy T5: Degree T6: Multimedia T7: Language T8: Undergraduate T9: Scholarship Figure 2.3 It demonstrates a simple idea of how a 9 x 7 term-by-document matrix is constructed from a small collection of SCS local web pages. The actual values assigned to the elements of the term-by-document matrix A = [aij] are usually weighted frequencies as 17

19 opposed to the raw counts of term occurrences (within a document or across the entire collection). The small collection of documents from Figure 2.3 can be used to illustrate simple query matching in a low-dimensional space. Since there are exactly 9 terms used to index the 7 documents, queries are represented as 9X1 vectors in the same way that each of the 7 titles is represented as a column of the 9X7 term-by- document matrix A. In order to retrieve the documents including the information about undergraduate degree, the query vector is: ( ); Query matching in the vector space model can be viewed as a search in the column space of the matrix A for the documents most similar to the query. One of the most common similarity measures used for query matching is the cosine of the angle between the query vector and the document vectors. Fig 2.4 [18] In constructing a term-by-document matrix, terms are usually identified by their word stems. In the example shown in the Figure 2.3, the words Degrees, Degrees are counted as 1 term. Stemming reduces the number of rows in the term- by-document matrix A. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Synonymy and polysemy are another two points, which need designer s attention. Synonymy refers to the use of synonyms or different words that have the same meaning, 18

20 and polysemy refers to words that have different meanings when used in varying contexts. Methods for handling the effects of synonymy and polysciny in the context of vector space models are not discussed further in this project because the limitation of time. The generalized vector model introduces a new idea which document and query representations are directly translated to the space. We adopted the vector model techniques and try to improve the performance of the SCS search engine during the design time. 19

21 Chapter 3 User Interface Design The goal of research in user interface design for information retrieval is to provide all people ready access to the information they desire. Information retrieval systems must not only provide efficient retrieval, but must also support the user in describing a problem that s/he does not understand well. Even in cases where the search function is well designed, a vocabulary problem is an importance reason to affect the retrieval performance. User may know what they are looking for, but lack the knowledge needed to articulate the problem in terms and abstractions used by the retrieval system. More experienced users with a particular subject in mind may have the ability to directly specify a query which results into a jump to a particular catalogue. From there, the user can refine his/her initial query by browsing from that point on. On the other hand, casual users without any prior knowledge of the contents of the system or users without any particular subject in mind may find it is much more difficult to freely navigate. So that good information retrieval system design combines a combination of support for information seeking strategies, such as browsing and direct querying, in an interface that provides effective cues to the location, use, and characteristics of the retrieved information. Much research is being conducted to attain better interface designs using such features as natural language, intelligent agents, and direct manipulation. All three of these techniques rely on the growing understanding of the way people think and how they acquire, store, and retrieve information. Up-to-date interface designs seek to solve the problems previously mentioned. The ultimate goal is to produce a design that is comprehensible, predictable, and controllable. Based on this we will study at three areas of research in interface design: natural 20

22 language processing, intelligent agents, and direct manipulation techniques. Finally how to integrate these design ideas in the Web search engine will be discussed in this section. 3.1 Natural Language The ability to simply talk to a computer and have it fulfil a request has been a long-time goal of both experimenters in user interface design and artificial intelligence. The goal of natural language processing is to minimise the training required for users. The more naturally users can express their information needs in plain English, the fewer burdens upon them to learn that system. Natural language enhancements have improved commercially available search engines in a number of ways but still remain very primitive to the original goals that were envisioned for them (Liddy, 1998). Some of these advances include such things as automatic truncation. Some databases are able to recognize with greater reliability the plural and singular forms of a noun. Some are able to add or subtract portions of a word, especially suffixes. We can see some automation in the identification of proper nouns. However, recognition is limited in these systems by simply looking for the words that begin with a capital letter. There are some search engines that recognize simple phrases, word variations, and concepts. These systems make use of "fuzzy matching" which is done by the computer looking for similar words or phrases in close proximity to others (Thunderstone, 1998). This is most useful in systems in order to compensate for errors in data entry and phonetics. For example, Yahoo ( suggests other possible ways to spell a term if the user is unsure about the spelling. Excite ( offers a list of words associated with the terms in the query to help users after a single search build a better set of keywords. However the use of natural language in user interface design has drawn much criticism from many sides. Philosophical objections abound about whether a computer can ever actually understand human language. Some studies have shown that often there is not much difference in performance between artificial and natural language systems (Ogden 21

23 & Bernick, 1997). Some experiments even showed that users were actually able to use the artificial language system faster and with more reliability once the system was learned. This begs the question of whether we need better natural language systems or better ways of teaching users. 3.2 Intelligent or artificial agents Intelligent agents have attracted a great deal of attention over the last few years in information retrieval area. The concept began with the notion that computers could become our personal secretaries, assistants, and reference librarians. An intelligent agent can autonomously carry out a task given by its user. It is a new way to reduce time spent on routine personal tasks increases the time available for more gratifying activities (Roesler & Hawkins, 1994). The vision of an intelligent agent proposed by interface designers in this area of work has altered over the last few years. In part, this change is due to the fact that computer technology is far removed from the initial vision of smart agents traversing the world's networks carrying out the bidding of their masters. This vision is still in the domain of the science fiction writer. Computer technology limits the current ability of even the simplest of agents. Even human-to-human communication of information needs is a difficult task because of the ambiguity of language. The information seeker often does not fully understand what he or she needs and how to express it. For computers to fully understand the ambiguous nature of a human information request is far, if ever, into the future. 3.3 Visualisation Information visualization is one of the most exciting works in the area of interface design for information retrieval. The basic idea is to transform data into graphic representations that will help users understand the information they have received from the machine (Pack, 1998). This concept takes the basics of GUI and pushes the limits of design to include such things as 3D imaging, filter-flow modelling, and dynamic query. 22

24 Much of the work being done in this area comes from many years of cognitive research and the belief that first-time users are frustrated by what they see on their displays (Shneiderman, 1997). They are using systems that provide them with few clues to the status of the system and how to use it. What they need is a system that removes the low cognitive burdens of navigation. This is done through search fields, menus, directmanipulation designs, and by following simple visual-coding rules that make these systems much easier to use. There are several key techniques in this field. The first one is overview. Users are given some sense of an overview of the entire collection. In traditional systems, users are returned information but have little knowledge of the scope of the entire collection. Information visualization systems will allow the user to see a 3D hierarchical directory of the document set. Zooming is the second technique. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. The zooming tool should work smoothly and allow for the user to quickly move back and forth while preserving his or her sense of position in the collection. Filter will allow for unwanted items to be thrown out. Users can sort through large numbers of references. Dynamic queries use numeric sliders, buttons, and alpha sliders to push always-unwanted items and to focus rapidly on items of interest. These queries are designed for quickness, reversibility, and immediate feedback. Once the number of items returned has been trimmed down, the information system should allow for easy browsing of the details of the items. Moreover, the relate task in interface design provides visualization about the relationships between items while in traditional systems, users sometimes have little information about the relationship of the items they are given. Another technique is history keeping. A good interface design should be able to undo, replay, and allow the user to refine his or her search. Information searching is a process, 23

25 and the interface should help the user retrace his or her steps if the wrong path is followed. Finally, a good interface is one that allows the extraction of sub-collections and of the query sets. Methods are being sought to save information in a format that could be imported to other applications. Some systems already allow the user to drag-and-drop an item into an , graphic page, or the next application window. However, there are still many technical difficulties that limit these design ideas. For instance faster computers and networks are needed for the approaches discussed here to work and will allow rapid movement through data fields. 3.4 User Interface Design in WWW Search Engine As we know, as the development of the GUI, the World Wide Web has played a tremendous role already on user access to information. The Web browser is the most popular form of electronic information retrieval interface in use today by all users. The interface design with its few simple buttons (e.g. back, forward, stop) makes many things easier. Hyperlinks provide information on related topics and help users to move down through a site. However, there are drawbacks to the Web browser. For instance, users tend to forget where they are and how to return because their paths are usually non-linear. Users also find it hard to remember what they previously looked at and exploring every link is tedious. Because the number of web documents is always very large, the amount of retrieved information and its quality is also unpredictable. The design of a user interface which permits gradual enlargement or refinement of the user s query by browsing through a graph of term and document subsets is a good way to improve user s ability in controlling the amount of output obtained from a query. Geller and Lesk [1983], in a study comparing menu and specification of terms, suggested, 24

26 Perhaps the most efficient way to do a subject search is to start whit a keyword search to locate the correct category and then browse through the classification. Like the theory in Visualisation, a method for viewing the whole site structure through a viewing window is very useful. An overview can show the topic domains represented within the collections, to help users select or eliminate sources from consideration and help they get started, directing them into general neighbourhoods, after which they can navigate using more detailed descriptions. The popular type of overview is display of large topical category hierarchies associated with the documents of a collection. Users can select a particular category, and search information in a smaller or larger scale. In some cases, the node of the category hierarchy can be associated to some relevant documents directly. However, it is difficult to design a good interface to integrate category selection into query specification. A reason is display of category hierarchies takes up large amounts of screen space. The idea of natural language is used commonly in the practice. We let user express their information needs in free format, and provide various search types. Moreover, relevance feedback techniques are also crucial to support the iterative refinement of information needs for interface design. In a relevance feedback cycle, the user is presented with a list of retrieved documents and, after examining them, marks those, which are relevant. Users select important terms, or expressions, attached to the documents that have been identified as relevant by the user, and enhancing the importance of these terms in a new query formulation. 25

27 Chapter 4 System Design 4.1 Analysing the current SCS Intranet web sites First thing is to investigate the size of the SCS web site. There are about one thousand available web pages and documents on the web sites. They are hosted on a Unix server of the Computer Studies. The template search system will be installed in the Cslin Linux server. 4.2 The System architecture User Interface Searching Modules Indexing Module Network Index Files Fig 4.1 System architecture Because the size of web site is not very big, it is not necessary to adopt three-tier server client architecture and use a special database server to store the index file. In the project, the index files are only stored as text file in the same Cslin Linux server. 26

28 The web crawler can run on a computer separate from the one providing search services through a separated interface. All indexing records are written to self-contained file that can be transferred from a search workstation up to the search server. 4.3 Building Index file process and Hits Ranking: Web Pages Remove Fluff words User query Fluff Words list Match and Ranking Search Results Stem Words Select and Weight keywords Create or update Index Files Fig 4.2 The keywords weighting system As we discussed in the Chapter2, the keywords weighting system is used to give each keyword a weight according to its frequency and inverse document frequency. Ranking system In the vector model a document and a user query are represented as t-dimensional vectors. This correlation can be quantified by the cosine of the angle between these 27

29 two vectors. When match the query and documents(web pages), the following formula is used to give the ranking of retrieved webpages. sim(d j, q) = d j * q /( d j q ) The retrieved web pages are sorted by the ranks, so that the most relevant document appears first. Search terms found in the title, keywords, or description are given additional weight. Removing fluff words Removing fluff words is first step to save space or to speed up searches. To save space, a search engine should remove those extremely common words such as the, web, a and is. Extremely common words are not useful for searching target web page. Therefore, they should not be put in the index file. The index file retains most of relevancy of original web page, and the extra space can be used to store more web pages. In theory, it is very obvious that it will take more time to search the target web page if the fluff words are kept in index file. For Example, the string the piano player, the search engine has to make three runs to find matches (again, this is oversimplified). First it looks for all matches of the, then all matches of piano, then all matches of player. Chances are, just looking for the last two words is enough to find relevant pages. Query and web page key words Stemming Sometimes a given topic is represented by different forms (e.g. plural or singular) of the same word. As well as plural and singular variation we have the gerund form (making a verb out of a noun as in index -> indexing) and different tenses (indexed)[13]. 28

30 Removing suffixes by automatic means is an operation, which is especially useful in the field of information retrieval. In the web IR environment, one has a collection of documents; each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vetor of words, or terms. Terms with a common stem will usually have similar meanings. Thus, stemming reduces the number of terms in the index file. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Moreover it can improve the recall rate of information retrieval. Therefore a good Stemming Algorithm must be used in the search engine. 4.4 User Interface Design According to the investigation of Chapter 3, the ideal user interface of the search engine had better support free format query input. For the purpose, the search form will provide various search types. User can choose from the interface form for their convenience. At the early stages of the project, I hope the SCS search engine can support catalogue overview as well. User can overview the whole structure of the local site and find their target directory quickly, and reduced the search scale. User relevance feedback has been taken into consideration. In this way, a new query will be used for modifying the original query and a number of returned documents will replace the original document collection for reducing the search scale. However, as we see, the SCS web site is not a big web site, so that it is not very necessary and useful to implement relevance feedback in the SCS search engine. 29

31 Hierarchical category overview new_student_information.html /MSc is.html /dms_msc.html /staff /research /campus Fig 4.3 Through the overview tree, user can click his/her interested area to reduce the scale of searching. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. It allows for the user to quickly move back and forth while preserving his or her sense of position in the collection. Unfortunately, it is difficult to design a good interface to integrate category selection into query specification, because displaying category hierarchies takes up large amounts of screen space. Another difficult is how to classify the hundreds of web pages correctly and how to let web administrator add new web pages to the right category when they update 30

32 the web site. In many cases, a same web page can belong to various directions. Thus when the design was put into implementation, the difficult can t be overcome in a short time. Finally, the design was aborted in the final project implementation. Three searching types Three searching types are provided to users. The engine will has the ability to handle simple, complex and string queries. All Terms Searching - The entire search string will be matched; for example, a search for Information retrieval will return primarily web pages containing the "Information and retrieval. Any Term Searching - The search string will be matched by individual word; for example, a search for Information Retrieval will return documents containing the words Information or "Retrieval. Phrase Searching - That is, all terms must occur next to each other and in order. 4.5 Administrator interface Design Web sites administrator can index or update web pages with a web crawler through web interface. To access the administration page user needs a password due to the reason of protecting the database. Administrator can reset the password through the web browser as well. 31

33 Chapter 5 Implementation 5.1 Programming Language Choosing Perl is a scripting language and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. It can be used on most platforms and with most web servers. Moreover, the functions of Perl in word processing are very powerful. The programming language is very easy to learn and use. For these reasons, Perl was chosen as the programming language to build the search engine. 5.2 Find an appropriate template search tools Due to the limitation of time, in this project, some free template search tools have been used for developing the SCS search engine. Price, Platform, capacity, ease of installation, and maintenance are the factors, which were taken into consideration for selecting the template search tools. After comparing several search tools, Fluid Dynamics Search Engine was chosen as the original template SCS search engine. It is programmed in Perl. The script can run well on Unix and Windows platforms or any other Perl CGI environment. This script will retrieve the pages over the web sites, and a web page s text, keywords, description, title, and address are all extracted and used for searching. Then it saves these contents in a separated text file (index file I). During the search process, the script searches the index file to find all links related to queries of user. The number of keyword hits sorts retrieved web pages. The title, description, size, last modified time, and address 32

34 of each document are shown to the user in the list of hits. The admin can configure the number of hits to show per page. 5.3 Adding Stemming Algorithm The template search engine does not adopt term stemming. In the SCS search engine, Porter's stemming algorithm [15] was used to implement this function. The codes were obtained from web site. After some alteration, they were combined into the SCS search engine as a function. 5.4 Remove More Fluff Words In the template scripts, there is a list of fluff words, however the SCS search engine should remove more fluff words for the specific collection. More fluff words come from MSc IS14 Module handouts [19]. 5.5 Improving the User Interface In the template one, user can search as any term and All terms two types. In the SCS search engine, the user interface was altered and Phrase search type was added in the search form. The search script also was edited for Phrase search. The search terms are stored in index file with former order in original web pages. When user chooses this search type, the query terms need match the whole terms in the phrase in right order. 33

35 Fig 5.1 Search Form Fig 5.3 Results list Because the reasons that has been mentioned in the Chapter4, the implementation of category overviews was aborted. 34

36 5.6 Using Vector Model Binary model is used in the original template search engine. In the final SCS search engine, a vector model is added. For comparing the affection of two models on the performance of search engine, the binary model was still kept in the SCS search engine. User can choose one of them from search form. The terms weighting system was programmed in Perl. According to the theory in the Chapter2, each term has been give to a weight according to their frequency and inverse document frequency in the indexing time. A separated perl script was programmed to implement this task. Weighting Program Index File I Web page key number Position File Web page key number Position of the web page Index File II Term Weights of the web pages Fig5.2 The term s weights are stored in another separated text file (Index file II). In order to save the CPU time in real time searching, a position file is created to keep the position of each web page in the index file II. The search engine can obtain the term weights of any documents (web pages) in the index file I through the position file directly. The Term Stemming will affect the weight results in some ways. For example, through stemming, computer, compute, computing, computability, all of the four terms will be 35

37 replaced by comput. Thus the frequency of the four terms will be summarised up to comput one term. In this way, these web pages including above terms will have more chance to be used match the relevance topic about computer. Retrieved documents are sorted by ranking rate. The rates depend on the query term weights and document term weights. In order to make the query look more natural to user, we don t give any format limitation of input. User can input some Symbols in their query such as asterisk, quotes, and brackets, which won t affect the search results. After removing fluff words, the query terms are treated equally. The weights of document terms are used to compute the ranks of the retrieved web pages. The ranking function is programmed in Perl as well and combined in the SCS search engine. 36

38 Chapter 6 Testing, Evaluation and Conclusions 6.1 Testing Environment The SCS search engine was tested under the following environmental conditions. The configurations used were those available to author. Server Software -Windows 98 with Perl5 (running locally) -Linux server (running at Browser Routines -Windows 98, 56k Modem link to remote server and using local server with Internet Explorer 4.0 -Windows NT 4.0, networked, with Internet Explorer 5.0 and Navigator Evaluation Criteria First of all, as discussed in the first chapter, recall and precision are two of the most important measures for evaluating information retrieval system. Recall - the proportion of relevant documents retrieved, is not considered a viable measure for Internet search engines because it is impossible to determine how many relevant items there are for a particular query. However, recall may be viable measure for intranet search engines because in the relatively confined intranet environment, it may be possible to identify with reasonable surety the documents relevant to a particular query. 37

Modern information retrieval

Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming