A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE. Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds

Size: px
Start display at page:

Download "A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE. Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds"

Transcription

1 A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds The candidate confirms that the work submitted is his own and appropriate credit has been given where reference has been made to the work of others.

2 Abstract This project has set out to discuss and review local web site search engine and information retrieval techniques. A web search engine for SCS intranet websites was built to aid the research. Objectives: The primary objectives of the project are as following: 1. Study and explore various Information retrieval techniques for improving effectiveness (optimise precision and recall) 2. Build a web search engine for the use of searching the Computer Studies web pages. Practise IR techniques in the search engine. 3. Test and evaluate the searching system To assess the impact of the IR techniques as they are incorporated into the implementation I have reviewed and explored the major information retrieval technologies and build a web search engine that adopt various approaches in these IR techniques. Moreover, the impacts of these IR techniques were evaluated by some experiments. The Personal objectives of the project were: 1. Extend my theoretical and practical knowledge in Information retrieval field 2. Gain experience of CGI Programming 3. Produce a fully working product The Deliverables of this project were: 1. A working SCS web search engine 2. This project report All of the objectives and deliverables of the project have been completed. This project gave me better understanding of the working and time scheduling of projects as well as specific problems, which can occur in projects. 1

3 Acknowledgements I would like to thank my project supervisor, Dr Stuart Roberts, for his very useful advice and support throughout the project. Moreover his wife, Mrs Ann Roberts and their lovely sons gave helpful contributions to the evaluation process. Credit should also be given to Dr Nick Efford and Dr Peter Jimack who have given me some useful advices as well. Thanks also go everyone who has helped during this project. Last but not least, especially thanks go to my wife for being supportive throughout this project. 2

4 Contents Abstract 1 Acknowledgement 2 Contents 3 Chapter 1 Introduction Background Local Site Search Components Structure of Local Site Search Engine System Evaluation 9 Chapter 2 Modelling Techniques Introduction Boolean Model Classical Vector Model Probabilistic Model Vector and Vector Space Models Term Weighting Systems Basic Vector Model Vector Space Models Chapter 3 User Interface Design Techniques Natural Language Intelligent or artificial agents Visualisation User Interface Design in WWW 24 3

5 Search Engine Chapter 4 System Design Analysing the current SCS Intranet web sites The System architecture Building Index file process and Hits Ranking User Interface Design Administrator interface Design 31 Chapter 5 System Implementation Programming Language Choosing Find an appropriate template search tools Adding Stemming Algorithm Remove More Fluff Words Improving the User Interface Using Vector Model 35 Chapter 6 System Testing, Evaluation and Conclusion 6.1 Testing Environment 6.2 Evaluation Criteria 6.3 Evaluation Process 6.4 Evaluation Results 6.5 Conclusions and future thoughts Appendix A 43 Appendix B 44 4

6 Appendix C 45 Appendix D Porter stemming algorithm 49 Appendix E 52 Reference 55 5

7 Chapter 1 Introduction 1.1 Background The world-wide Web is a large distributed digital information space. As we know, it started as an organization-wide collaborative environment for sharing research documents in nuclear physics at CERN in Nowadays, the web is becoming a universal repository of human knowledge and culture, which encompasses diverse information resources including personal Web pages; online digital libraries; virtual museums; product and service catalogs; government information for public dissemination; research publications; and Gopher, FTP, Usenet news, and Mail servers. However, how to find useful information on the web is frequently a tedious and difficult task. To satisfy the information need, user might navigate the space of web links search for expected information. There's a paradox: the more information the site has, the more useful it is, and the harder to navigate! Since the whole web space is vast and almost unknown, navigation task is usually inefficient. For naive users, the problem becomes much more difficult, which might entirely frustrate all their efforts. As the WWW started to explode in terms of users, servers, and pages, it became obvious that search capabilities had to be added, and a flourishing market of public search engines emerged. The first Web-based Search engines come into existence in Regular users of the Internet's World Wide Web are very familiar with the sites that have been developed that allow users to search across all or a part of the Web. Examples of these sites include Infoseek, Altavista, Hotbot, etc. They are excellent tools for finding information that is stored across the Web and can assist users in finding information that would otherwise be difficult to locate. Later, the growing number of intranets, i.e. intraorganisational webs (SCS intranet website) hidden from the Internet behind firewalls or 6

8 proxies, has created a need not only for public web search services but also for the internal web search product. IR is currently maturing together components from many constituent research communities, each with their own traditions and characteristics. These constituents include mathematical modelling of information using logical and probabilistic approaches, and modelling the information seeking process of searchers. These approaches are being added to a strong experimental and hypothesis testing tradition with IR research, which itself is being augmented by more psychological style experiments introduced to the computing science community via human computer interaction and cognitive science research. Nowadays, research in the IR includes modelling, document classification and categorization, system architecture, user interfaces, data visualization, filtering, languages, etc. For building a local web site search engine, it will involve in applying one or more above techniques to improve the searching performance. The objectives of the project is to explore various Information Retrieval techniques for improving effectiveness and build a local site web search engine (SCS web search engine) for the use of searching the Computer Studies local web pages. 1.2 Local Site Search Components A local site information retrieval system can be divided into several relatively separated components. They can be built and maintained individually. Taking the SCS search engine for example, it includes following parts. Search Engine is the key component of the system. The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results page to the server. In our search engine, CGI Programs is written in Perl. It is an easy and powerful programming language. Perl scripts can be used on most platforms and communicating with most web servers using the CGI standard. A site visitor filling in data and clicking a Search button on an HTML form invokes the site 7

9 search CGIs. They will take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page. However, there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. The Search Indexer program creates Search Index File. The file stores the data from web site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of the site, this file can become very large. In the SCS search engine, several text files are playing this role. These files can be updated through friendly interface in order to keep them synchronised with the pages and avoid providing obsolete results. On the other hand, creating or updating index file is a time consuming task. Providing a HTML interface - Search Forms for visitors to enter their search terms and specify their preferences for the search is a necessary part for any web search engine. Finally, how to lists the web pages, which contain text matching the search term(s) is the forth part, which need take into consideration when we design the search engine. These retrieved hits are sorted in some kind of relevance order, usually based on the number of times the search terms appear according to its particular algorithm and whether they're in a title or header. Most results listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). In the SCS search engine, it also includes the date modified, file size, and URL. 1.3 Structure of Local Site Search Engine Same as most of web search engine, the SCS search engine is the application, which searches the data and returns the results to the client. This means creating an HTML page in the specified format. It searches within an index, created by an Indexer application. The users enter their search terms in a text field, and may select appropriate settings in the form. When they submit their queries, the server passes that data to the search engine application. 8

10 Figure 1.1 Architecture of Search Engine [13] Once the database has been searched, the results will be returned to the user in a usable format. The format should include enough of a description of the records or documents returned to allow the user to make a decision about which document he/she wishes to display. 1.4 System Evaluation The most common measures of search system performance are time and space. Especially in a system designed for providing data retrieval, the response time and the space required are usually the metrics of most interest and importance adopted for evaluating the searching system. In a system designed for providing information retrieval, the retrieved documents are not exact answers and have to be ranked according to their relevance to the query. Thus, besides time and space, recall and precision are also very important as retrieval evaluation measures. Recall is measured as the ratio of the number of relevant documents retrieved to the total number of relevant items that exist in the collection, and precision is measured as the ratio of the number of relevant documents retrieved to the total number of documents retrieved. A desirable IR system is one that achieves high precision for most levels of recall (if not all). 9

11 In fact, the search engines do not search the "entire" web pages in "real time"; they search databases, which have been created from resources on the Internet. Therefore, the nature and content of the databases are relevant factors in evaluating the search engine as well. The best way to evaluate a search engine is to employ "real end-users or real-life queries". Besides collecting hard data from original files, we sent the Evaluation form (Appendix E) to the real user, and get the feedback to analysis. In order to assess the affections of various techniques on information retrieval system performance separately, SCS search engine was designed to run in various conditions. We had evaluated the search engine in these conditions and compared the feedback results together. 10

12 Chapter2 Modelling Techniques 2.1. Introduction In information retrieval system, it is customary to represent each stored record and each information requests by sets of content identifiers, or terms. The terms attached to the items may be assigned automatically or chosen manually; in either case, the terms used for a given item collectively represent the information content of the item. An index term is a keyword (or group of related words), which has some meaning of its own (i.e., which usually has the semantics of a noun). Retrieval based on index terms is simple but raises key questions regarding the information retrieval task. In fact, a lot of the semantics in a document or user request will be lost when we replace its text with a set of words. Furthermore, matching between each document and the user request is attempted in this very imprecise space of index terms. Thus, the documents retrieved in response to a user request expressed as a set of keywords are frequently irrelevant. How to predict which documents are not relevant and which are relevant, even in which degree they are is a central problem of any modern IR system. In the web search system, the target documents will be hundreds of web pages. A ranking algorithm operates according to basic premises regarding the notion of document relevance. Distinct sets of premises (regarding web page relevance) yield distinct information retrieval models. A lot of model can be adopted to rank the documents, but the vector model is usually preferred due to its simplicity. Before begin to design the searching system, we reviewed the currently popular modelling techniques and choose an appropriate one for building our web search engine. There are three modelling types in information retrieval fields: the Boolean, the vector, and the probabilistic model. The classic models in information retrieval consider that 11

13 each document is described by a set of representative keywords called index terms. An index term is simply a document word whose semantics helps in remembering the document s main themes. In this project, these modelling techniques have been investigated for designing the SCS search engine. Boolean Model The Boolean model is a simple retrieval model based on set theory and Boolean algebra. The Boolean model provides a framework, which is easy to grasp by a common user of an IR system. However weights are not assigned to designate term importance. Instead, a term is either used to identify a given item or it is not: when assigned, the term may be assumed to carry a weigh of 1; otherwise it carries a weight of 0. Given its inherent simplicity and neat formalism, the Boolean model received great attention in past years and was adopted by many of the early commercial bibliographic systems and data retrieval system. The original template search engine adopts this type of model, which is very easy for designing. On the other hand, the Boolean model suffers from major drawbacks. First of all, its retrieval strategy is based on a binary decision criterion without any notion of a grading scale, which prevents good retrieval performance. For instance, although it maybe simplifies the input processing, the retrieval operations may become complicated by the fact that in a binary indexing system the documents retrieved in response to a given query are indistinguishable from each other. All retrieved items are treated as equally close to the query, because the number of terms assigned jointly to the query and the retrieved items is the same for all items. This leads to the retrieval by the system user of potentially large classes of items that are difficult to deal with. Classical Vector Model To sort the above limitation of binary weights, the easiest way of introducing distinction among classes of retrieved items is to use weighted instead of binary index terms to identify queries and documents. Thus, the vector model comes up accomplished by assigning non-binary weights to index terms in queries and in documents (web pages in the SCS search engine). In such a situation it becomes possible to compute the degree of 12

14 similarity between each document stored in the system and the user query. The records can then be retrieved in a ranked order. By sorting the retrieved web pages in decreasing order of this degree of similarity, the vector model takes into consideration documents, which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model. In the practice, a large variety of alternative ranking methods have been compared to the vector model but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives [3]. Its additional advantages are simple and fast. Based on these reasons, the vector model is the most popular retrieval model among researchers, practitioners, and the Web community. Probabilistic Model The probabilistic model attempts to capture the IR problem within a probabilistic framework. It is based on the following assumption: Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be non-relevant. In this model, the index term weight variables are all binary. [3] 13

15 In theory, the main advantage of the probabilistic model is that documents are ranked in decreasing order of their probability of being relevant. However its disadvantages is also obvious. It need guess the initial separation of documents into relevant and non-relevant sets. The method doesn t take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary); and the adoption of the independence assumption for index terms. In this project, we intend to use the Vector model to build the system because this model is simple and yet powerful. The vector operations can be performed efficiently to handle very large collections. Furthermore, it has been shown that the retrieval effectiveness is significantly higher compared to that of Boolean retrieval models. So that we will discuss this model further in the next section. 2.2 Vector and Vector Space Models Term Weighting Systems Term weighting methods are used to place different emphases on a term s (or a key word s) relationship to the other terms and other documents in the collection. Currently there are several mathematical models being used to relate the term precision weights to the frequency of occurrence of the terms in a given document collection and to the number of relevant documents a user wishes to retrieve in response to a query. In this section, we will discuss the main idea behind the most popular and effective termweighting techniques. Automatic indexing techniques are statistical. It is based on Luhn s hypothesis: the frequency of word occurrence in an article furnishes a useful measure of word significance. Normally high frequency terms tend to be too common in text generally to be of use, low frequency terms are considered unlikely to characterise the central information content of the document, so the idea is to measure the frequencies of words and apply two cutoffs, preserving only the mid-frequency words. 14

16 In the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj. The index terms can be weighted proportionally to the frequencies of occurrence. Let freqij be the frequency with which f ij = freq ij / max ( freq l= 1.. t lj ) term ki occurs in dj and define the normalised frequency, fij, to be: Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection. In general sense, a good index term will both describe the document well, but also distinguish that document from all others in the collection. This factor is usually referred to as the inverse document frequency. How well an index term distinguishes a document can be measured by the inverse frequency of the occurrence of the term in all documents. Define inverse document frequency idfi for ki: idfi = log(n/ni) Here, N docs in a collection ni are indexed by term ki. After combining the term frequency and the inverse document frequency, the bestknown term-weighting schemes use weighting schemes, which are given by: w ij = fij log( N/ni) Finally, the weights can be normalized to ensure they are between 0 and 1: w ij = fij log( N/ni) t = l 1 ( flj log( N/ni) ) 2 Basic Vector Model For the vector model, the weight wij associated with a pair (ki, dj) is positive and non-binary. Further, the index terms in the query are also weighted. Let Wi,q be the weight associated with the pair [ki, q], where wi,q 0. Then, the query vector q is defined as q = (Wl,q, W2,q,... Wt,q) where t is the total number of index terms in the system. As above, the vector for a document dj is represented by dj = (W1j, W2j, - - -, Wtj). 15

17 Therefore, a document dj and a user query q are represented as t-dimensional vectors. The vector model proposes to evaluate the degree of similarity of the document di with regard to the query q as the correlation between the vectors di and q. This correlation can be quantified by the cosine of the angle between these two vectors. sim(d j, q) = d j * q /( d j q ) T1 D1 n (w ij w Ei=1 iq) n w 2 n w 2 Ei=1 ij Ei=1 iq = cos(ß) w 11 w 1q ß Q w21 w2q T2 Fig 2.1 Vector Space Models A vector space model is an alternative algebraic model, which can be used to represent both terms and documents in a text collection. Contrary to the basic Boolean query model, the vector space model allows finding the documents, which are the most similar to the query without the need for a 100 percent match. In the vector space model, both queries and documents are represented as term vectors of the form Di = (di1, di2,...,dit) and Q = (q1, q2,...,qt). A document collection is then represented as a term-document matrix A: Fig

18 The similarity between a query vector Q and a document term vector D can then be computed as: This method of computing similarity coefficients between queries and documents is particularly advantageous because it allows one to sort all documents in decreasing order of similarity to a particular query. This also permits one to adapt the size of the retrieved document set to the user's needs. Here, we just take an example in a low-dimensional space. Documents D1 D2 D3 D4 D5 D6 D7 Terms T1: Information T2: Leeds T3: Research T4: Vacancy T5: Degree T6: Multimedia T7: Language T8: Undergraduate T9: Scholarship Figure 2.3 It demonstrates a simple idea of how a 9 x 7 term-by-document matrix is constructed from a small collection of SCS local web pages. The actual values assigned to the elements of the term-by-document matrix A = [aij] are usually weighted frequencies as 17

19 opposed to the raw counts of term occurrences (within a document or across the entire collection). The small collection of documents from Figure 2.3 can be used to illustrate simple query matching in a low-dimensional space. Since there are exactly 9 terms used to index the 7 documents, queries are represented as 9X1 vectors in the same way that each of the 7 titles is represented as a column of the 9X7 term-by- document matrix A. In order to retrieve the documents including the information about undergraduate degree, the query vector is: ( ); Query matching in the vector space model can be viewed as a search in the column space of the matrix A for the documents most similar to the query. One of the most common similarity measures used for query matching is the cosine of the angle between the query vector and the document vectors. Fig 2.4 [18] In constructing a term-by-document matrix, terms are usually identified by their word stems. In the example shown in the Figure 2.3, the words Degrees, Degrees are counted as 1 term. Stemming reduces the number of rows in the term- by-document matrix A. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Synonymy and polysemy are another two points, which need designer s attention. Synonymy refers to the use of synonyms or different words that have the same meaning, 18

20 and polysemy refers to words that have different meanings when used in varying contexts. Methods for handling the effects of synonymy and polysciny in the context of vector space models are not discussed further in this project because the limitation of time. The generalized vector model introduces a new idea which document and query representations are directly translated to the space. We adopted the vector model techniques and try to improve the performance of the SCS search engine during the design time. 19

21 Chapter 3 User Interface Design The goal of research in user interface design for information retrieval is to provide all people ready access to the information they desire. Information retrieval systems must not only provide efficient retrieval, but must also support the user in describing a problem that s/he does not understand well. Even in cases where the search function is well designed, a vocabulary problem is an importance reason to affect the retrieval performance. User may know what they are looking for, but lack the knowledge needed to articulate the problem in terms and abstractions used by the retrieval system. More experienced users with a particular subject in mind may have the ability to directly specify a query which results into a jump to a particular catalogue. From there, the user can refine his/her initial query by browsing from that point on. On the other hand, casual users without any prior knowledge of the contents of the system or users without any particular subject in mind may find it is much more difficult to freely navigate. So that good information retrieval system design combines a combination of support for information seeking strategies, such as browsing and direct querying, in an interface that provides effective cues to the location, use, and characteristics of the retrieved information. Much research is being conducted to attain better interface designs using such features as natural language, intelligent agents, and direct manipulation. All three of these techniques rely on the growing understanding of the way people think and how they acquire, store, and retrieve information. Up-to-date interface designs seek to solve the problems previously mentioned. The ultimate goal is to produce a design that is comprehensible, predictable, and controllable. Based on this we will study at three areas of research in interface design: natural 20

22 language processing, intelligent agents, and direct manipulation techniques. Finally how to integrate these design ideas in the Web search engine will be discussed in this section. 3.1 Natural Language The ability to simply talk to a computer and have it fulfil a request has been a long-time goal of both experimenters in user interface design and artificial intelligence. The goal of natural language processing is to minimise the training required for users. The more naturally users can express their information needs in plain English, the fewer burdens upon them to learn that system. Natural language enhancements have improved commercially available search engines in a number of ways but still remain very primitive to the original goals that were envisioned for them (Liddy, 1998). Some of these advances include such things as automatic truncation. Some databases are able to recognize with greater reliability the plural and singular forms of a noun. Some are able to add or subtract portions of a word, especially suffixes. We can see some automation in the identification of proper nouns. However, recognition is limited in these systems by simply looking for the words that begin with a capital letter. There are some search engines that recognize simple phrases, word variations, and concepts. These systems make use of "fuzzy matching" which is done by the computer looking for similar words or phrases in close proximity to others (Thunderstone, 1998). This is most useful in systems in order to compensate for errors in data entry and phonetics. For example, Yahoo ( suggests other possible ways to spell a term if the user is unsure about the spelling. Excite ( offers a list of words associated with the terms in the query to help users after a single search build a better set of keywords. However the use of natural language in user interface design has drawn much criticism from many sides. Philosophical objections abound about whether a computer can ever actually understand human language. Some studies have shown that often there is not much difference in performance between artificial and natural language systems (Ogden 21

23 & Bernick, 1997). Some experiments even showed that users were actually able to use the artificial language system faster and with more reliability once the system was learned. This begs the question of whether we need better natural language systems or better ways of teaching users. 3.2 Intelligent or artificial agents Intelligent agents have attracted a great deal of attention over the last few years in information retrieval area. The concept began with the notion that computers could become our personal secretaries, assistants, and reference librarians. An intelligent agent can autonomously carry out a task given by its user. It is a new way to reduce time spent on routine personal tasks increases the time available for more gratifying activities (Roesler & Hawkins, 1994). The vision of an intelligent agent proposed by interface designers in this area of work has altered over the last few years. In part, this change is due to the fact that computer technology is far removed from the initial vision of smart agents traversing the world's networks carrying out the bidding of their masters. This vision is still in the domain of the science fiction writer. Computer technology limits the current ability of even the simplest of agents. Even human-to-human communication of information needs is a difficult task because of the ambiguity of language. The information seeker often does not fully understand what he or she needs and how to express it. For computers to fully understand the ambiguous nature of a human information request is far, if ever, into the future. 3.3 Visualisation Information visualization is one of the most exciting works in the area of interface design for information retrieval. The basic idea is to transform data into graphic representations that will help users understand the information they have received from the machine (Pack, 1998). This concept takes the basics of GUI and pushes the limits of design to include such things as 3D imaging, filter-flow modelling, and dynamic query. 22

24 Much of the work being done in this area comes from many years of cognitive research and the belief that first-time users are frustrated by what they see on their displays (Shneiderman, 1997). They are using systems that provide them with few clues to the status of the system and how to use it. What they need is a system that removes the low cognitive burdens of navigation. This is done through search fields, menus, directmanipulation designs, and by following simple visual-coding rules that make these systems much easier to use. There are several key techniques in this field. The first one is overview. Users are given some sense of an overview of the entire collection. In traditional systems, users are returned information but have little knowledge of the scope of the entire collection. Information visualization systems will allow the user to see a 3D hierarchical directory of the document set. Zooming is the second technique. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. The zooming tool should work smoothly and allow for the user to quickly move back and forth while preserving his or her sense of position in the collection. Filter will allow for unwanted items to be thrown out. Users can sort through large numbers of references. Dynamic queries use numeric sliders, buttons, and alpha sliders to push always-unwanted items and to focus rapidly on items of interest. These queries are designed for quickness, reversibility, and immediate feedback. Once the number of items returned has been trimmed down, the information system should allow for easy browsing of the details of the items. Moreover, the relate task in interface design provides visualization about the relationships between items while in traditional systems, users sometimes have little information about the relationship of the items they are given. Another technique is history keeping. A good interface design should be able to undo, replay, and allow the user to refine his or her search. Information searching is a process, 23

25 and the interface should help the user retrace his or her steps if the wrong path is followed. Finally, a good interface is one that allows the extraction of sub-collections and of the query sets. Methods are being sought to save information in a format that could be imported to other applications. Some systems already allow the user to drag-and-drop an item into an , graphic page, or the next application window. However, there are still many technical difficulties that limit these design ideas. For instance faster computers and networks are needed for the approaches discussed here to work and will allow rapid movement through data fields. 3.4 User Interface Design in WWW Search Engine As we know, as the development of the GUI, the World Wide Web has played a tremendous role already on user access to information. The Web browser is the most popular form of electronic information retrieval interface in use today by all users. The interface design with its few simple buttons (e.g. back, forward, stop) makes many things easier. Hyperlinks provide information on related topics and help users to move down through a site. However, there are drawbacks to the Web browser. For instance, users tend to forget where they are and how to return because their paths are usually non-linear. Users also find it hard to remember what they previously looked at and exploring every link is tedious. Because the number of web documents is always very large, the amount of retrieved information and its quality is also unpredictable. The design of a user interface which permits gradual enlargement or refinement of the user s query by browsing through a graph of term and document subsets is a good way to improve user s ability in controlling the amount of output obtained from a query. Geller and Lesk [1983], in a study comparing menu and specification of terms, suggested, 24

26 Perhaps the most efficient way to do a subject search is to start whit a keyword search to locate the correct category and then browse through the classification. Like the theory in Visualisation, a method for viewing the whole site structure through a viewing window is very useful. An overview can show the topic domains represented within the collections, to help users select or eliminate sources from consideration and help they get started, directing them into general neighbourhoods, after which they can navigate using more detailed descriptions. The popular type of overview is display of large topical category hierarchies associated with the documents of a collection. Users can select a particular category, and search information in a smaller or larger scale. In some cases, the node of the category hierarchy can be associated to some relevant documents directly. However, it is difficult to design a good interface to integrate category selection into query specification. A reason is display of category hierarchies takes up large amounts of screen space. The idea of natural language is used commonly in the practice. We let user express their information needs in free format, and provide various search types. Moreover, relevance feedback techniques are also crucial to support the iterative refinement of information needs for interface design. In a relevance feedback cycle, the user is presented with a list of retrieved documents and, after examining them, marks those, which are relevant. Users select important terms, or expressions, attached to the documents that have been identified as relevant by the user, and enhancing the importance of these terms in a new query formulation. 25

27 Chapter 4 System Design 4.1 Analysing the current SCS Intranet web sites First thing is to investigate the size of the SCS web site. There are about one thousand available web pages and documents on the web sites. They are hosted on a Unix server of the Computer Studies. The template search system will be installed in the Cslin Linux server. 4.2 The System architecture User Interface Searching Modules Indexing Module Network Index Files Fig 4.1 System architecture Because the size of web site is not very big, it is not necessary to adopt three-tier server client architecture and use a special database server to store the index file. In the project, the index files are only stored as text file in the same Cslin Linux server. 26

28 The web crawler can run on a computer separate from the one providing search services through a separated interface. All indexing records are written to self-contained file that can be transferred from a search workstation up to the search server. 4.3 Building Index file process and Hits Ranking: Web Pages Remove Fluff words User query Fluff Words list Match and Ranking Search Results Stem Words Select and Weight keywords Create or update Index Files Fig 4.2 The keywords weighting system As we discussed in the Chapter2, the keywords weighting system is used to give each keyword a weight according to its frequency and inverse document frequency. Ranking system In the vector model a document and a user query are represented as t-dimensional vectors. This correlation can be quantified by the cosine of the angle between these 27

29 two vectors. When match the query and documents(web pages), the following formula is used to give the ranking of retrieved webpages. sim(d j, q) = d j * q /( d j q ) The retrieved web pages are sorted by the ranks, so that the most relevant document appears first. Search terms found in the title, keywords, or description are given additional weight. Removing fluff words Removing fluff words is first step to save space or to speed up searches. To save space, a search engine should remove those extremely common words such as the, web, a and is. Extremely common words are not useful for searching target web page. Therefore, they should not be put in the index file. The index file retains most of relevancy of original web page, and the extra space can be used to store more web pages. In theory, it is very obvious that it will take more time to search the target web page if the fluff words are kept in index file. For Example, the string the piano player, the search engine has to make three runs to find matches (again, this is oversimplified). First it looks for all matches of the, then all matches of piano, then all matches of player. Chances are, just looking for the last two words is enough to find relevant pages. Query and web page key words Stemming Sometimes a given topic is represented by different forms (e.g. plural or singular) of the same word. As well as plural and singular variation we have the gerund form (making a verb out of a noun as in index -> indexing) and different tenses (indexed)[13]. 28

30 Removing suffixes by automatic means is an operation, which is especially useful in the field of information retrieval. In the web IR environment, one has a collection of documents; each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vetor of words, or terms. Terms with a common stem will usually have similar meanings. Thus, stemming reduces the number of terms in the index file. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Moreover it can improve the recall rate of information retrieval. Therefore a good Stemming Algorithm must be used in the search engine. 4.4 User Interface Design According to the investigation of Chapter 3, the ideal user interface of the search engine had better support free format query input. For the purpose, the search form will provide various search types. User can choose from the interface form for their convenience. At the early stages of the project, I hope the SCS search engine can support catalogue overview as well. User can overview the whole structure of the local site and find their target directory quickly, and reduced the search scale. User relevance feedback has been taken into consideration. In this way, a new query will be used for modifying the original query and a number of returned documents will replace the original document collection for reducing the search scale. However, as we see, the SCS web site is not a big web site, so that it is not very necessary and useful to implement relevance feedback in the SCS search engine. 29

31 Hierarchical category overview new_student_information.html /MSc is.html /dms_msc.html /staff /research /campus Fig 4.3 Through the overview tree, user can click his/her interested area to reduce the scale of searching. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. It allows for the user to quickly move back and forth while preserving his or her sense of position in the collection. Unfortunately, it is difficult to design a good interface to integrate category selection into query specification, because displaying category hierarchies takes up large amounts of screen space. Another difficult is how to classify the hundreds of web pages correctly and how to let web administrator add new web pages to the right category when they update 30

32 the web site. In many cases, a same web page can belong to various directions. Thus when the design was put into implementation, the difficult can t be overcome in a short time. Finally, the design was aborted in the final project implementation. Three searching types Three searching types are provided to users. The engine will has the ability to handle simple, complex and string queries. All Terms Searching - The entire search string will be matched; for example, a search for Information retrieval will return primarily web pages containing the "Information and retrieval. Any Term Searching - The search string will be matched by individual word; for example, a search for Information Retrieval will return documents containing the words Information or "Retrieval. Phrase Searching - That is, all terms must occur next to each other and in order. 4.5 Administrator interface Design Web sites administrator can index or update web pages with a web crawler through web interface. To access the administration page user needs a password due to the reason of protecting the database. Administrator can reset the password through the web browser as well. 31

33 Chapter 5 Implementation 5.1 Programming Language Choosing Perl is a scripting language and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. It can be used on most platforms and with most web servers. Moreover, the functions of Perl in word processing are very powerful. The programming language is very easy to learn and use. For these reasons, Perl was chosen as the programming language to build the search engine. 5.2 Find an appropriate template search tools Due to the limitation of time, in this project, some free template search tools have been used for developing the SCS search engine. Price, Platform, capacity, ease of installation, and maintenance are the factors, which were taken into consideration for selecting the template search tools. After comparing several search tools, Fluid Dynamics Search Engine was chosen as the original template SCS search engine. It is programmed in Perl. The script can run well on Unix and Windows platforms or any other Perl CGI environment. This script will retrieve the pages over the web sites, and a web page s text, keywords, description, title, and address are all extracted and used for searching. Then it saves these contents in a separated text file (index file I). During the search process, the script searches the index file to find all links related to queries of user. The number of keyword hits sorts retrieved web pages. The title, description, size, last modified time, and address 32

34 of each document are shown to the user in the list of hits. The admin can configure the number of hits to show per page. 5.3 Adding Stemming Algorithm The template search engine does not adopt term stemming. In the SCS search engine, Porter's stemming algorithm [15] was used to implement this function. The codes were obtained from web site. After some alteration, they were combined into the SCS search engine as a function. 5.4 Remove More Fluff Words In the template scripts, there is a list of fluff words, however the SCS search engine should remove more fluff words for the specific collection. More fluff words come from MSc IS14 Module handouts [19]. 5.5 Improving the User Interface In the template one, user can search as any term and All terms two types. In the SCS search engine, the user interface was altered and Phrase search type was added in the search form. The search script also was edited for Phrase search. The search terms are stored in index file with former order in original web pages. When user chooses this search type, the query terms need match the whole terms in the phrase in right order. 33

35 Fig 5.1 Search Form Fig 5.3 Results list Because the reasons that has been mentioned in the Chapter4, the implementation of category overviews was aborted. 34

36 5.6 Using Vector Model Binary model is used in the original template search engine. In the final SCS search engine, a vector model is added. For comparing the affection of two models on the performance of search engine, the binary model was still kept in the SCS search engine. User can choose one of them from search form. The terms weighting system was programmed in Perl. According to the theory in the Chapter2, each term has been give to a weight according to their frequency and inverse document frequency in the indexing time. A separated perl script was programmed to implement this task. Weighting Program Index File I Web page key number Position File Web page key number Position of the web page Index File II Term Weights of the web pages Fig5.2 The term s weights are stored in another separated text file (Index file II). In order to save the CPU time in real time searching, a position file is created to keep the position of each web page in the index file II. The search engine can obtain the term weights of any documents (web pages) in the index file I through the position file directly. The Term Stemming will affect the weight results in some ways. For example, through stemming, computer, compute, computing, computability, all of the four terms will be 35

37 replaced by comput. Thus the frequency of the four terms will be summarised up to comput one term. In this way, these web pages including above terms will have more chance to be used match the relevance topic about computer. Retrieved documents are sorted by ranking rate. The rates depend on the query term weights and document term weights. In order to make the query look more natural to user, we don t give any format limitation of input. User can input some Symbols in their query such as asterisk, quotes, and brackets, which won t affect the search results. After removing fluff words, the query terms are treated equally. The weights of document terms are used to compute the ranks of the retrieved web pages. The ranking function is programmed in Perl as well and combined in the SCS search engine. 36

38 Chapter 6 Testing, Evaluation and Conclusions 6.1 Testing Environment The SCS search engine was tested under the following environmental conditions. The configurations used were those available to author. Server Software -Windows 98 with Perl5 (running locally) -Linux server (running at Browser Routines -Windows 98, 56k Modem link to remote server and using local server with Internet Explorer 4.0 -Windows NT 4.0, networked, with Internet Explorer 5.0 and Navigator Evaluation Criteria First of all, as discussed in the first chapter, recall and precision are two of the most important measures for evaluating information retrieval system. Recall - the proportion of relevant documents retrieved, is not considered a viable measure for Internet search engines because it is impossible to determine how many relevant items there are for a particular query. However, recall may be viable measure for intranet search engines because in the relatively confined intranet environment, it may be possible to identify with reasonable surety the documents relevant to a particular query. 37

Modern information retrieval

Modern information retrieval Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

WebBiblio Subject Gateway System:

WebBiblio Subject Gateway System: WebBiblio Subject Gateway System: An Open Source Solution for Internet Resources Management 1. Introduction Jack Eapen C. 1 With the advent of the Internet, the rate of information explosion increased

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

CSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database

CSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database CSC105, Introduction to Computer Science Lab02: Web Searching and Search Services I. Introduction and Background. The World Wide Web is often likened to a global electronic library of information. Such

More information

21. Search Models and UIs for IR

21. Search Models and UIs for IR 21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Session 10: Information Retrieval

Session 10: Information Retrieval INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

Using SportDiscus (and Other Databases)

Using SportDiscus (and Other Databases) Using SportDiscus (and Other Databases) Databases are at the heart of research. Google is a database, and it receives almost 6 billion searches every day. Believe it or not, however, there are better databases

More information

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation

Perfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Perfect Timing Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Problem & Solution College students do their best to plan out their daily tasks, but

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc.

Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. This paper provides an overview of a presentation at the Internet Librarian International conference in London

More information

PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES.

PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES. PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES by Richard Spinks A Master s paper submitted to the faculty of the School of Information

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Information Retrieval

Information Retrieval s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Interaction Style Categories. COSC 3461 User Interfaces. What is a Command-line Interface? Command-line Interfaces

Interaction Style Categories. COSC 3461 User Interfaces. What is a Command-line Interface? Command-line Interfaces COSC User Interfaces Module 2 Interaction Styles What is a Command-line Interface? An interface where the user types commands in direct response to a prompt Examples Operating systems MS-DOS Unix Applications

More information

User Guide. Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved.

User Guide. Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved. User Guide Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved. Central Search User Guide Table of Contents Welcome to Central Search... 3 Starting Your Search... 4 Basic Search & Advanced

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

Contents 1. INTRODUCTION... 3

Contents 1. INTRODUCTION... 3 Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...

More information

Part 11: Collaborative Filtering. Francesco Ricci

Part 11: Collaborative Filtering. Francesco Ricci Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating

More information

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey I. Automatic Document; Retrieval Systems The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey decimal aystemf Library of Congress system). Cross-indexing

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

This session will provide an overview of the research resources and strategies that can be used when conducting business research.

This session will provide an overview of the research resources and strategies that can be used when conducting business research. Welcome! This session will provide an overview of the research resources and strategies that can be used when conducting business research. Many of these research tips will also be applicable to courses

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Taxonomies and controlled vocabularies best practices for metadata

Taxonomies and controlled vocabularies best practices for metadata Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley

More information

Cognitive Walkthrough Evaluation

Cognitive Walkthrough Evaluation Columbia University Libraries / Information Services Digital Library Collections (Beta) Cognitive Walkthrough Evaluation by Michael Benowitz Pratt Institute, School of Library and Information Science Executive

More information

Automated Cognitive Walkthrough for the Web (AutoCWW)

Automated Cognitive Walkthrough for the Web (AutoCWW) CHI 2002 Workshop: Automatically Evaluating the Usability of Web Sites Workshop Date: April 21-22, 2002 Automated Cognitive Walkthrough for the Web (AutoCWW) Position Paper by Marilyn Hughes Blackmon Marilyn

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Searching the Evidence using EBSCOHost

Searching the Evidence using EBSCOHost CAMBRIDGE UNIVERSITY LIBRARY MEDICAL LIBRARY Supporting Literature Searching Searching the Evidence using EBSCOHost ATHENS CINAHL Use to search CINAHL with an NHS ATHENS login (or PsycINFO with University

More information

Automated Item Banking and Test Development Model used at the SSAC.

Automated Item Banking and Test Development Model used at the SSAC. Automated Item Banking and Test Development Model used at the SSAC. Tural Mustafayev The State Student Admission Commission of the Azerbaijan Republic Item Bank Department Item Banking For many years tests

More information

COCHRANE LIBRARY. Contents

COCHRANE LIBRARY. Contents COCHRANE LIBRARY Contents Introduction... 2 Learning outcomes... 2 About this workbook... 2 1. Getting Started... 3 a. Finding the Cochrane Library... 3 b. Understanding the databases in the Cochrane Library...

More information

Document Clustering for Mediated Information Access The WebCluster Project

Document Clustering for Mediated Information Access The WebCluster Project Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

CHAPTER-26 Mining Text Databases

CHAPTER-26 Mining Text Databases CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Using Scopus. Scopus. To access Scopus, go to the Article Databases tab on the library home page and browse by title.

Using Scopus. Scopus. To access Scopus, go to the Article Databases tab on the library home page and browse by title. Using Scopus Databases are the heart of academic research. We would all be lost without them. Google is a database, and it receives almost 6 billion searches every day. Believe it or not, however, there

More information

Search Engine Optimization (SEO) using HTML Meta-Tags

Search Engine Optimization (SEO) using HTML Meta-Tags 2018 IJSRST Volume 4 Issue 9 Print ISSN : 2395-6011 Online ISSN : 2395-602X Themed Section: Science and Technology Search Engine Optimization (SEO) using HTML Meta-Tags Dr. Birajkumar V. Patel, Dr. Raina

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

User Interfaces Assignment 3: Heuristic Re-Design of Craigslist (English) Completed by Group 5 November 10, 2015 Phase 1: Analysis of Usability Issues Homepage Error 1: Overall the page is overwhelming

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

CPS122 Lecture: From Python to Java last revised January 4, Objectives:

CPS122 Lecture: From Python to Java last revised January 4, Objectives: Objectives: CPS122 Lecture: From Python to Java last revised January 4, 2017 1. To introduce the notion of a compiled language 2. To introduce the notions of data type and a statically typed language 3.

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2)

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2) SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay Lecture #10 Process Modelling DFD, Function Decomp (Part 2) Let us continue with the data modeling topic. So far we have seen

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Reference By Any Other Name

Reference By Any Other Name Boise State University ScholarWorks Library Faculty Publications and Presentations The Albertsons Library 4-1-2010 Reference By Any Other Name Ellie Dworak Boise State University This document was originally

More information

Active Server Pages Architecture

Active Server Pages Architecture Active Server Pages Architecture Li Yi South Bank University Contents 1. Introduction... 2 1.1 Host-based databases... 2 1.2 Client/server databases... 2 1.3 Web databases... 3 2. Active Server Pages...

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Skill Area 209: Use Internet Technology. Software Application (SWA)

Skill Area 209: Use Internet Technology. Software Application (SWA) Skill Area 209: Use Internet Technology Software Application (SWA) Skill Area 209.1 Use Browser for Research (10hrs) 209.1.1 Familiarise with the Environment of Selected Browser Internet Technology The

More information

Provided by TryEngineering.org -

Provided by TryEngineering.org - Provided by TryEngineering.org - Lesson Focus Lesson focuses on exploring how the development of search engines has revolutionized Internet. Students work in teams to understand the technology behind search

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information

More information

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

CABI Training Materials. Ovid Silver Platter (SP) platform. Simple Searching of CAB Abstracts and Global Health KNOWLEDGE FOR LIFE.

CABI Training Materials. Ovid Silver Platter (SP) platform. Simple Searching of CAB Abstracts and Global Health KNOWLEDGE FOR LIFE. CABI Training Materials Ovid Silver Platter (SP) platform Simple Searching of CAB Abstracts and Global Health www.cabi.org KNOWLEDGE FOR LIFE Contents The OvidSP Database Selection Screen... 3 The Ovid

More information

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about:

WWW and Web Browser. 6.1 Objectives In this chapter we will learn about: WWW and Web Browser 6.0 Introduction WWW stands for World Wide Web. WWW is a collection of interlinked hypertext pages on the Internet. Hypertext is text that references some other information that can

More information

Due on: May 12, Team Members: Arpan Bhattacharya. Collin Breslin. Thkeya Smith. INFO (Spring 2013): Human-Computer Interaction

Due on: May 12, Team Members: Arpan Bhattacharya. Collin Breslin. Thkeya Smith. INFO (Spring 2013): Human-Computer Interaction Week 6 Assignment: Heuristic Evaluation of Due on: May 12 2013 Team Members: Arpan Bhattacharya Collin Breslin Thkeya Smith INFO 608-902 (Spring 2013): Human-Computer Interaction Group 1 HE Process Overview

More information

Vision Document for Multi-Agent Research Tool (MART)

Vision Document for Multi-Agent Research Tool (MART) Vision Document for Multi-Agent Research Tool (MART) Version 2.0 Submitted in partial fulfillment of the requirements for the degree MSE Madhukar Kumar CIS 895 MSE Project Kansas State University 1 1.

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

USER SEARCH INTERFACES. Design and Application

USER SEARCH INTERFACES. Design and Application USER SEARCH INTERFACES Design and Application KEEP IT SIMPLE Search is a means towards some other end, rather than a goal in itself. Search is a mentally intensive task. Task Example: You have a friend

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

evision Review Project - Engagement Simon McLean, Head of Web & IT Support Information & Data Services.

evision Review Project - Engagement Simon McLean, Head of Web & IT Support Information & Data Services. evision Review Project - Engagement Monitoring Simon McLean, Head of Web & IT Support Information & Data Services. What is Usability? Why Bother? Types of usability testing Usability Testing in evision

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Boolean model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Creating a Course Web Site

Creating a Course Web Site Creating a Course Web Site What you will do: Use Web templates Use shared borders for navigation Apply themes As an educator or administrator, you are always looking for new and exciting ways to communicate

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Web-interface for Monte-Carlo event generators

Web-interface for Monte-Carlo event generators Web-interface for Monte-Carlo event generators Jonathan Blender Applied and Engineering Physics, Cornell University, Under Professor K. Matchev and Doctoral Candidate R.C. Group Sponsored by the University

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051/ Final Group Projects Groups of variable

More information

Predictive Coding. A Low Nerd Factor Overview. kpmg.ch/forensic

Predictive Coding. A Low Nerd Factor Overview. kpmg.ch/forensic Predictive Coding A Low Nerd Factor Overview kpmg.ch/forensic Background and Utility Predictive coding is a word we hear more and more often in the field of E-Discovery. The technology is said to increase

More information

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 DRACULA CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 Acknowledgments Support for this work was provided by the National Science Foundation Award No. CMMI-1304383 and CMMI-1234859. Any opinions,

More information

CSI Lab 02. Tuesday, January 21st

CSI Lab 02. Tuesday, January 21st CSI Lab 02 Tuesday, January 21st Objectives: Explore some basic functionality of python Introduction Last week we talked about the fact that a computer is, among other things, a tool to perform high speed

More information