A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE. Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds
|
|
- Ruth Barnett
- 5 years ago
- Views:
Transcription
1 A SEARCH ENGINE FOR THE SCHOOL OF COMPUTING WEB SITE Jian Ye Deng MSC Information Systems School of Computer Studies University of Leeds The candidate confirms that the work submitted is his own and appropriate credit has been given where reference has been made to the work of others.
2 Abstract This project has set out to discuss and review local web site search engine and information retrieval techniques. A web search engine for SCS intranet websites was built to aid the research. Objectives: The primary objectives of the project are as following: 1. Study and explore various Information retrieval techniques for improving effectiveness (optimise precision and recall) 2. Build a web search engine for the use of searching the Computer Studies web pages. Practise IR techniques in the search engine. 3. Test and evaluate the searching system To assess the impact of the IR techniques as they are incorporated into the implementation I have reviewed and explored the major information retrieval technologies and build a web search engine that adopt various approaches in these IR techniques. Moreover, the impacts of these IR techniques were evaluated by some experiments. The Personal objectives of the project were: 1. Extend my theoretical and practical knowledge in Information retrieval field 2. Gain experience of CGI Programming 3. Produce a fully working product The Deliverables of this project were: 1. A working SCS web search engine 2. This project report All of the objectives and deliverables of the project have been completed. This project gave me better understanding of the working and time scheduling of projects as well as specific problems, which can occur in projects. 1
3 Acknowledgements I would like to thank my project supervisor, Dr Stuart Roberts, for his very useful advice and support throughout the project. Moreover his wife, Mrs Ann Roberts and their lovely sons gave helpful contributions to the evaluation process. Credit should also be given to Dr Nick Efford and Dr Peter Jimack who have given me some useful advices as well. Thanks also go everyone who has helped during this project. Last but not least, especially thanks go to my wife for being supportive throughout this project. 2
4 Contents Abstract 1 Acknowledgement 2 Contents 3 Chapter 1 Introduction Background Local Site Search Components Structure of Local Site Search Engine System Evaluation 9 Chapter 2 Modelling Techniques Introduction Boolean Model Classical Vector Model Probabilistic Model Vector and Vector Space Models Term Weighting Systems Basic Vector Model Vector Space Models Chapter 3 User Interface Design Techniques Natural Language Intelligent or artificial agents Visualisation User Interface Design in WWW 24 3
5 Search Engine Chapter 4 System Design Analysing the current SCS Intranet web sites The System architecture Building Index file process and Hits Ranking User Interface Design Administrator interface Design 31 Chapter 5 System Implementation Programming Language Choosing Find an appropriate template search tools Adding Stemming Algorithm Remove More Fluff Words Improving the User Interface Using Vector Model 35 Chapter 6 System Testing, Evaluation and Conclusion 6.1 Testing Environment 6.2 Evaluation Criteria 6.3 Evaluation Process 6.4 Evaluation Results 6.5 Conclusions and future thoughts Appendix A 43 Appendix B 44 4
6 Appendix C 45 Appendix D Porter stemming algorithm 49 Appendix E 52 Reference 55 5
7 Chapter 1 Introduction 1.1 Background The world-wide Web is a large distributed digital information space. As we know, it started as an organization-wide collaborative environment for sharing research documents in nuclear physics at CERN in Nowadays, the web is becoming a universal repository of human knowledge and culture, which encompasses diverse information resources including personal Web pages; online digital libraries; virtual museums; product and service catalogs; government information for public dissemination; research publications; and Gopher, FTP, Usenet news, and Mail servers. However, how to find useful information on the web is frequently a tedious and difficult task. To satisfy the information need, user might navigate the space of web links search for expected information. There's a paradox: the more information the site has, the more useful it is, and the harder to navigate! Since the whole web space is vast and almost unknown, navigation task is usually inefficient. For naive users, the problem becomes much more difficult, which might entirely frustrate all their efforts. As the WWW started to explode in terms of users, servers, and pages, it became obvious that search capabilities had to be added, and a flourishing market of public search engines emerged. The first Web-based Search engines come into existence in Regular users of the Internet's World Wide Web are very familiar with the sites that have been developed that allow users to search across all or a part of the Web. Examples of these sites include Infoseek, Altavista, Hotbot, etc. They are excellent tools for finding information that is stored across the Web and can assist users in finding information that would otherwise be difficult to locate. Later, the growing number of intranets, i.e. intraorganisational webs (SCS intranet website) hidden from the Internet behind firewalls or 6
8 proxies, has created a need not only for public web search services but also for the internal web search product. IR is currently maturing together components from many constituent research communities, each with their own traditions and characteristics. These constituents include mathematical modelling of information using logical and probabilistic approaches, and modelling the information seeking process of searchers. These approaches are being added to a strong experimental and hypothesis testing tradition with IR research, which itself is being augmented by more psychological style experiments introduced to the computing science community via human computer interaction and cognitive science research. Nowadays, research in the IR includes modelling, document classification and categorization, system architecture, user interfaces, data visualization, filtering, languages, etc. For building a local web site search engine, it will involve in applying one or more above techniques to improve the searching performance. The objectives of the project is to explore various Information Retrieval techniques for improving effectiveness and build a local site web search engine (SCS web search engine) for the use of searching the Computer Studies local web pages. 1.2 Local Site Search Components A local site information retrieval system can be divided into several relatively separated components. They can be built and maintained individually. Taking the SCS search engine for example, it includes following parts. Search Engine is the key component of the system. The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the results page to the server. In our search engine, CGI Programs is written in Perl. It is an easy and powerful programming language. Perl scripts can be used on most platforms and communicating with most web servers using the CGI standard. A site visitor filling in data and clicking a Search button on an HTML form invokes the site 7
9 search CGIs. They will take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page. However, there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. The Search Indexer program creates Search Index File. The file stores the data from web site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of the site, this file can become very large. In the SCS search engine, several text files are playing this role. These files can be updated through friendly interface in order to keep them synchronised with the pages and avoid providing obsolete results. On the other hand, creating or updating index file is a time consuming task. Providing a HTML interface - Search Forms for visitors to enter their search terms and specify their preferences for the search is a necessary part for any web search engine. Finally, how to lists the web pages, which contain text matching the search term(s) is the forth part, which need take into consideration when we design the search engine. These retrieved hits are sorted in some kind of relevance order, usually based on the number of times the search terms appear according to its particular algorithm and whether they're in a title or header. Most results listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). In the SCS search engine, it also includes the date modified, file size, and URL. 1.3 Structure of Local Site Search Engine Same as most of web search engine, the SCS search engine is the application, which searches the data and returns the results to the client. This means creating an HTML page in the specified format. It searches within an index, created by an Indexer application. The users enter their search terms in a text field, and may select appropriate settings in the form. When they submit their queries, the server passes that data to the search engine application. 8
10 Figure 1.1 Architecture of Search Engine [13] Once the database has been searched, the results will be returned to the user in a usable format. The format should include enough of a description of the records or documents returned to allow the user to make a decision about which document he/she wishes to display. 1.4 System Evaluation The most common measures of search system performance are time and space. Especially in a system designed for providing data retrieval, the response time and the space required are usually the metrics of most interest and importance adopted for evaluating the searching system. In a system designed for providing information retrieval, the retrieved documents are not exact answers and have to be ranked according to their relevance to the query. Thus, besides time and space, recall and precision are also very important as retrieval evaluation measures. Recall is measured as the ratio of the number of relevant documents retrieved to the total number of relevant items that exist in the collection, and precision is measured as the ratio of the number of relevant documents retrieved to the total number of documents retrieved. A desirable IR system is one that achieves high precision for most levels of recall (if not all). 9
11 In fact, the search engines do not search the "entire" web pages in "real time"; they search databases, which have been created from resources on the Internet. Therefore, the nature and content of the databases are relevant factors in evaluating the search engine as well. The best way to evaluate a search engine is to employ "real end-users or real-life queries". Besides collecting hard data from original files, we sent the Evaluation form (Appendix E) to the real user, and get the feedback to analysis. In order to assess the affections of various techniques on information retrieval system performance separately, SCS search engine was designed to run in various conditions. We had evaluated the search engine in these conditions and compared the feedback results together. 10
12 Chapter2 Modelling Techniques 2.1. Introduction In information retrieval system, it is customary to represent each stored record and each information requests by sets of content identifiers, or terms. The terms attached to the items may be assigned automatically or chosen manually; in either case, the terms used for a given item collectively represent the information content of the item. An index term is a keyword (or group of related words), which has some meaning of its own (i.e., which usually has the semantics of a noun). Retrieval based on index terms is simple but raises key questions regarding the information retrieval task. In fact, a lot of the semantics in a document or user request will be lost when we replace its text with a set of words. Furthermore, matching between each document and the user request is attempted in this very imprecise space of index terms. Thus, the documents retrieved in response to a user request expressed as a set of keywords are frequently irrelevant. How to predict which documents are not relevant and which are relevant, even in which degree they are is a central problem of any modern IR system. In the web search system, the target documents will be hundreds of web pages. A ranking algorithm operates according to basic premises regarding the notion of document relevance. Distinct sets of premises (regarding web page relevance) yield distinct information retrieval models. A lot of model can be adopted to rank the documents, but the vector model is usually preferred due to its simplicity. Before begin to design the searching system, we reviewed the currently popular modelling techniques and choose an appropriate one for building our web search engine. There are three modelling types in information retrieval fields: the Boolean, the vector, and the probabilistic model. The classic models in information retrieval consider that 11
13 each document is described by a set of representative keywords called index terms. An index term is simply a document word whose semantics helps in remembering the document s main themes. In this project, these modelling techniques have been investigated for designing the SCS search engine. Boolean Model The Boolean model is a simple retrieval model based on set theory and Boolean algebra. The Boolean model provides a framework, which is easy to grasp by a common user of an IR system. However weights are not assigned to designate term importance. Instead, a term is either used to identify a given item or it is not: when assigned, the term may be assumed to carry a weigh of 1; otherwise it carries a weight of 0. Given its inherent simplicity and neat formalism, the Boolean model received great attention in past years and was adopted by many of the early commercial bibliographic systems and data retrieval system. The original template search engine adopts this type of model, which is very easy for designing. On the other hand, the Boolean model suffers from major drawbacks. First of all, its retrieval strategy is based on a binary decision criterion without any notion of a grading scale, which prevents good retrieval performance. For instance, although it maybe simplifies the input processing, the retrieval operations may become complicated by the fact that in a binary indexing system the documents retrieved in response to a given query are indistinguishable from each other. All retrieved items are treated as equally close to the query, because the number of terms assigned jointly to the query and the retrieved items is the same for all items. This leads to the retrieval by the system user of potentially large classes of items that are difficult to deal with. Classical Vector Model To sort the above limitation of binary weights, the easiest way of introducing distinction among classes of retrieved items is to use weighted instead of binary index terms to identify queries and documents. Thus, the vector model comes up accomplished by assigning non-binary weights to index terms in queries and in documents (web pages in the SCS search engine). In such a situation it becomes possible to compute the degree of 12
14 similarity between each document stored in the system and the user query. The records can then be retrieved in a ranked order. By sorting the retrieved web pages in decreasing order of this degree of similarity, the vector model takes into consideration documents, which match the query terms only partially. The main resultant effect is that the ranked document answer set is a lot more precise (in the sense that it better matches the user information need) than the document answer set retrieved by the Boolean model. In the practice, a large variety of alternative ranking methods have been compared to the vector model but the consensus seems to be that, in general, the vector model is either superior or almost as good as the known alternatives [3]. Its additional advantages are simple and fast. Based on these reasons, the vector model is the most popular retrieval model among researchers, practitioners, and the Web community. Probabilistic Model The probabilistic model attempts to capture the IR problem within a probabilistic framework. It is based on the following assumption: Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximize the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be non-relevant. In this model, the index term weight variables are all binary. [3] 13
15 In theory, the main advantage of the probabilistic model is that documents are ranked in decreasing order of their probability of being relevant. However its disadvantages is also obvious. It need guess the initial separation of documents into relevant and non-relevant sets. The method doesn t take into account the frequency with which an index term occurs inside a document (i.e., all weights are binary); and the adoption of the independence assumption for index terms. In this project, we intend to use the Vector model to build the system because this model is simple and yet powerful. The vector operations can be performed efficiently to handle very large collections. Furthermore, it has been shown that the retrieval effectiveness is significantly higher compared to that of Boolean retrieval models. So that we will discuss this model further in the next section. 2.2 Vector and Vector Space Models Term Weighting Systems Term weighting methods are used to place different emphases on a term s (or a key word s) relationship to the other terms and other documents in the collection. Currently there are several mathematical models being used to relate the term precision weights to the frequency of occurrence of the terms in a given document collection and to the number of relevant documents a user wishes to retrieve in response to a query. In this section, we will discuss the main idea behind the most popular and effective termweighting techniques. Automatic indexing techniques are statistical. It is based on Luhn s hypothesis: the frequency of word occurrence in an article furnishes a useful measure of word significance. Normally high frequency terms tend to be too common in text generally to be of use, low frequency terms are considered unlikely to characterise the central information content of the document, so the idea is to measure the frequencies of words and apply two cutoffs, preserving only the mid-frequency words. 14
16 In the vector model, intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj. The index terms can be weighted proportionally to the frequencies of occurrence. Let freqij be the frequency with which f ij = freq ij / max ( freq l= 1.. t lj ) term ki occurs in dj and define the normalised frequency, fij, to be: Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection. In general sense, a good index term will both describe the document well, but also distinguish that document from all others in the collection. This factor is usually referred to as the inverse document frequency. How well an index term distinguishes a document can be measured by the inverse frequency of the occurrence of the term in all documents. Define inverse document frequency idfi for ki: idfi = log(n/ni) Here, N docs in a collection ni are indexed by term ki. After combining the term frequency and the inverse document frequency, the bestknown term-weighting schemes use weighting schemes, which are given by: w ij = fij log( N/ni) Finally, the weights can be normalized to ensure they are between 0 and 1: w ij = fij log( N/ni) t = l 1 ( flj log( N/ni) ) 2 Basic Vector Model For the vector model, the weight wij associated with a pair (ki, dj) is positive and non-binary. Further, the index terms in the query are also weighted. Let Wi,q be the weight associated with the pair [ki, q], where wi,q 0. Then, the query vector q is defined as q = (Wl,q, W2,q,... Wt,q) where t is the total number of index terms in the system. As above, the vector for a document dj is represented by dj = (W1j, W2j, - - -, Wtj). 15
17 Therefore, a document dj and a user query q are represented as t-dimensional vectors. The vector model proposes to evaluate the degree of similarity of the document di with regard to the query q as the correlation between the vectors di and q. This correlation can be quantified by the cosine of the angle between these two vectors. sim(d j, q) = d j * q /( d j q ) T1 D1 n (w ij w Ei=1 iq) n w 2 n w 2 Ei=1 ij Ei=1 iq = cos(ß) w 11 w 1q ß Q w21 w2q T2 Fig 2.1 Vector Space Models A vector space model is an alternative algebraic model, which can be used to represent both terms and documents in a text collection. Contrary to the basic Boolean query model, the vector space model allows finding the documents, which are the most similar to the query without the need for a 100 percent match. In the vector space model, both queries and documents are represented as term vectors of the form Di = (di1, di2,...,dit) and Q = (q1, q2,...,qt). A document collection is then represented as a term-document matrix A: Fig
18 The similarity between a query vector Q and a document term vector D can then be computed as: This method of computing similarity coefficients between queries and documents is particularly advantageous because it allows one to sort all documents in decreasing order of similarity to a particular query. This also permits one to adapt the size of the retrieved document set to the user's needs. Here, we just take an example in a low-dimensional space. Documents D1 D2 D3 D4 D5 D6 D7 Terms T1: Information T2: Leeds T3: Research T4: Vacancy T5: Degree T6: Multimedia T7: Language T8: Undergraduate T9: Scholarship Figure 2.3 It demonstrates a simple idea of how a 9 x 7 term-by-document matrix is constructed from a small collection of SCS local web pages. The actual values assigned to the elements of the term-by-document matrix A = [aij] are usually weighted frequencies as 17
19 opposed to the raw counts of term occurrences (within a document or across the entire collection). The small collection of documents from Figure 2.3 can be used to illustrate simple query matching in a low-dimensional space. Since there are exactly 9 terms used to index the 7 documents, queries are represented as 9X1 vectors in the same way that each of the 7 titles is represented as a column of the 9X7 term-by- document matrix A. In order to retrieve the documents including the information about undergraduate degree, the query vector is: ( ); Query matching in the vector space model can be viewed as a search in the column space of the matrix A for the documents most similar to the query. One of the most common similarity measures used for query matching is the cosine of the angle between the query vector and the document vectors. Fig 2.4 [18] In constructing a term-by-document matrix, terms are usually identified by their word stems. In the example shown in the Figure 2.3, the words Degrees, Degrees are counted as 1 term. Stemming reduces the number of rows in the term- by-document matrix A. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Synonymy and polysemy are another two points, which need designer s attention. Synonymy refers to the use of synonyms or different words that have the same meaning, 18
20 and polysemy refers to words that have different meanings when used in varying contexts. Methods for handling the effects of synonymy and polysciny in the context of vector space models are not discussed further in this project because the limitation of time. The generalized vector model introduces a new idea which document and query representations are directly translated to the space. We adopted the vector model techniques and try to improve the performance of the SCS search engine during the design time. 19
21 Chapter 3 User Interface Design The goal of research in user interface design for information retrieval is to provide all people ready access to the information they desire. Information retrieval systems must not only provide efficient retrieval, but must also support the user in describing a problem that s/he does not understand well. Even in cases where the search function is well designed, a vocabulary problem is an importance reason to affect the retrieval performance. User may know what they are looking for, but lack the knowledge needed to articulate the problem in terms and abstractions used by the retrieval system. More experienced users with a particular subject in mind may have the ability to directly specify a query which results into a jump to a particular catalogue. From there, the user can refine his/her initial query by browsing from that point on. On the other hand, casual users without any prior knowledge of the contents of the system or users without any particular subject in mind may find it is much more difficult to freely navigate. So that good information retrieval system design combines a combination of support for information seeking strategies, such as browsing and direct querying, in an interface that provides effective cues to the location, use, and characteristics of the retrieved information. Much research is being conducted to attain better interface designs using such features as natural language, intelligent agents, and direct manipulation. All three of these techniques rely on the growing understanding of the way people think and how they acquire, store, and retrieve information. Up-to-date interface designs seek to solve the problems previously mentioned. The ultimate goal is to produce a design that is comprehensible, predictable, and controllable. Based on this we will study at three areas of research in interface design: natural 20
22 language processing, intelligent agents, and direct manipulation techniques. Finally how to integrate these design ideas in the Web search engine will be discussed in this section. 3.1 Natural Language The ability to simply talk to a computer and have it fulfil a request has been a long-time goal of both experimenters in user interface design and artificial intelligence. The goal of natural language processing is to minimise the training required for users. The more naturally users can express their information needs in plain English, the fewer burdens upon them to learn that system. Natural language enhancements have improved commercially available search engines in a number of ways but still remain very primitive to the original goals that were envisioned for them (Liddy, 1998). Some of these advances include such things as automatic truncation. Some databases are able to recognize with greater reliability the plural and singular forms of a noun. Some are able to add or subtract portions of a word, especially suffixes. We can see some automation in the identification of proper nouns. However, recognition is limited in these systems by simply looking for the words that begin with a capital letter. There are some search engines that recognize simple phrases, word variations, and concepts. These systems make use of "fuzzy matching" which is done by the computer looking for similar words or phrases in close proximity to others (Thunderstone, 1998). This is most useful in systems in order to compensate for errors in data entry and phonetics. For example, Yahoo ( suggests other possible ways to spell a term if the user is unsure about the spelling. Excite ( offers a list of words associated with the terms in the query to help users after a single search build a better set of keywords. However the use of natural language in user interface design has drawn much criticism from many sides. Philosophical objections abound about whether a computer can ever actually understand human language. Some studies have shown that often there is not much difference in performance between artificial and natural language systems (Ogden 21
23 & Bernick, 1997). Some experiments even showed that users were actually able to use the artificial language system faster and with more reliability once the system was learned. This begs the question of whether we need better natural language systems or better ways of teaching users. 3.2 Intelligent or artificial agents Intelligent agents have attracted a great deal of attention over the last few years in information retrieval area. The concept began with the notion that computers could become our personal secretaries, assistants, and reference librarians. An intelligent agent can autonomously carry out a task given by its user. It is a new way to reduce time spent on routine personal tasks increases the time available for more gratifying activities (Roesler & Hawkins, 1994). The vision of an intelligent agent proposed by interface designers in this area of work has altered over the last few years. In part, this change is due to the fact that computer technology is far removed from the initial vision of smart agents traversing the world's networks carrying out the bidding of their masters. This vision is still in the domain of the science fiction writer. Computer technology limits the current ability of even the simplest of agents. Even human-to-human communication of information needs is a difficult task because of the ambiguity of language. The information seeker often does not fully understand what he or she needs and how to express it. For computers to fully understand the ambiguous nature of a human information request is far, if ever, into the future. 3.3 Visualisation Information visualization is one of the most exciting works in the area of interface design for information retrieval. The basic idea is to transform data into graphic representations that will help users understand the information they have received from the machine (Pack, 1998). This concept takes the basics of GUI and pushes the limits of design to include such things as 3D imaging, filter-flow modelling, and dynamic query. 22
24 Much of the work being done in this area comes from many years of cognitive research and the belief that first-time users are frustrated by what they see on their displays (Shneiderman, 1997). They are using systems that provide them with few clues to the status of the system and how to use it. What they need is a system that removes the low cognitive burdens of navigation. This is done through search fields, menus, directmanipulation designs, and by following simple visual-coding rules that make these systems much easier to use. There are several key techniques in this field. The first one is overview. Users are given some sense of an overview of the entire collection. In traditional systems, users are returned information but have little knowledge of the scope of the entire collection. Information visualization systems will allow the user to see a 3D hierarchical directory of the document set. Zooming is the second technique. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. The zooming tool should work smoothly and allow for the user to quickly move back and forth while preserving his or her sense of position in the collection. Filter will allow for unwanted items to be thrown out. Users can sort through large numbers of references. Dynamic queries use numeric sliders, buttons, and alpha sliders to push always-unwanted items and to focus rapidly on items of interest. These queries are designed for quickness, reversibility, and immediate feedback. Once the number of items returned has been trimmed down, the information system should allow for easy browsing of the details of the items. Moreover, the relate task in interface design provides visualization about the relationships between items while in traditional systems, users sometimes have little information about the relationship of the items they are given. Another technique is history keeping. A good interface design should be able to undo, replay, and allow the user to refine his or her search. Information searching is a process, 23
25 and the interface should help the user retrace his or her steps if the wrong path is followed. Finally, a good interface is one that allows the extraction of sub-collections and of the query sets. Methods are being sought to save information in a format that could be imported to other applications. Some systems already allow the user to drag-and-drop an item into an , graphic page, or the next application window. However, there are still many technical difficulties that limit these design ideas. For instance faster computers and networks are needed for the approaches discussed here to work and will allow rapid movement through data fields. 3.4 User Interface Design in WWW Search Engine As we know, as the development of the GUI, the World Wide Web has played a tremendous role already on user access to information. The Web browser is the most popular form of electronic information retrieval interface in use today by all users. The interface design with its few simple buttons (e.g. back, forward, stop) makes many things easier. Hyperlinks provide information on related topics and help users to move down through a site. However, there are drawbacks to the Web browser. For instance, users tend to forget where they are and how to return because their paths are usually non-linear. Users also find it hard to remember what they previously looked at and exploring every link is tedious. Because the number of web documents is always very large, the amount of retrieved information and its quality is also unpredictable. The design of a user interface which permits gradual enlargement or refinement of the user s query by browsing through a graph of term and document subsets is a good way to improve user s ability in controlling the amount of output obtained from a query. Geller and Lesk [1983], in a study comparing menu and specification of terms, suggested, 24
26 Perhaps the most efficient way to do a subject search is to start whit a keyword search to locate the correct category and then browse through the classification. Like the theory in Visualisation, a method for viewing the whole site structure through a viewing window is very useful. An overview can show the topic domains represented within the collections, to help users select or eliminate sources from consideration and help they get started, directing them into general neighbourhoods, after which they can navigate using more detailed descriptions. The popular type of overview is display of large topical category hierarchies associated with the documents of a collection. Users can select a particular category, and search information in a smaller or larger scale. In some cases, the node of the category hierarchy can be associated to some relevant documents directly. However, it is difficult to design a good interface to integrate category selection into query specification. A reason is display of category hierarchies takes up large amounts of screen space. The idea of natural language is used commonly in the practice. We let user express their information needs in free format, and provide various search types. Moreover, relevance feedback techniques are also crucial to support the iterative refinement of information needs for interface design. In a relevance feedback cycle, the user is presented with a list of retrieved documents and, after examining them, marks those, which are relevant. Users select important terms, or expressions, attached to the documents that have been identified as relevant by the user, and enhancing the importance of these terms in a new query formulation. 25
27 Chapter 4 System Design 4.1 Analysing the current SCS Intranet web sites First thing is to investigate the size of the SCS web site. There are about one thousand available web pages and documents on the web sites. They are hosted on a Unix server of the Computer Studies. The template search system will be installed in the Cslin Linux server. 4.2 The System architecture User Interface Searching Modules Indexing Module Network Index Files Fig 4.1 System architecture Because the size of web site is not very big, it is not necessary to adopt three-tier server client architecture and use a special database server to store the index file. In the project, the index files are only stored as text file in the same Cslin Linux server. 26
28 The web crawler can run on a computer separate from the one providing search services through a separated interface. All indexing records are written to self-contained file that can be transferred from a search workstation up to the search server. 4.3 Building Index file process and Hits Ranking: Web Pages Remove Fluff words User query Fluff Words list Match and Ranking Search Results Stem Words Select and Weight keywords Create or update Index Files Fig 4.2 The keywords weighting system As we discussed in the Chapter2, the keywords weighting system is used to give each keyword a weight according to its frequency and inverse document frequency. Ranking system In the vector model a document and a user query are represented as t-dimensional vectors. This correlation can be quantified by the cosine of the angle between these 27
29 two vectors. When match the query and documents(web pages), the following formula is used to give the ranking of retrieved webpages. sim(d j, q) = d j * q /( d j q ) The retrieved web pages are sorted by the ranks, so that the most relevant document appears first. Search terms found in the title, keywords, or description are given additional weight. Removing fluff words Removing fluff words is first step to save space or to speed up searches. To save space, a search engine should remove those extremely common words such as the, web, a and is. Extremely common words are not useful for searching target web page. Therefore, they should not be put in the index file. The index file retains most of relevancy of original web page, and the extra space can be used to store more web pages. In theory, it is very obvious that it will take more time to search the target web page if the fluff words are kept in index file. For Example, the string the piano player, the search engine has to make three runs to find matches (again, this is oversimplified). First it looks for all matches of the, then all matches of piano, then all matches of player. Chances are, just looking for the last two words is enough to find relevant pages. Query and web page key words Stemming Sometimes a given topic is represented by different forms (e.g. plural or singular) of the same word. As well as plural and singular variation we have the gerund form (making a verb out of a noun as in index -> indexing) and different tenses (indexed)[13]. 28
30 Removing suffixes by automatic means is an operation, which is especially useful in the field of information retrieval. In the web IR environment, one has a collection of documents; each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vetor of words, or terms. Terms with a common stem will usually have similar meanings. Thus, stemming reduces the number of terms in the index file. The reduction of storage (via stemming) is certainly an important consideration for large collections of web documents. Moreover it can improve the recall rate of information retrieval. Therefore a good Stemming Algorithm must be used in the search engine. 4.4 User Interface Design According to the investigation of Chapter 3, the ideal user interface of the search engine had better support free format query input. For the purpose, the search form will provide various search types. User can choose from the interface form for their convenience. At the early stages of the project, I hope the SCS search engine can support catalogue overview as well. User can overview the whole structure of the local site and find their target directory quickly, and reduced the search scale. User relevance feedback has been taken into consideration. In this way, a new query will be used for modifying the original query and a number of returned documents will replace the original document collection for reducing the search scale. However, as we see, the SCS web site is not a big web site, so that it is not very necessary and useful to implement relevance feedback in the SCS search engine. 29
31 Hierarchical category overview new_student_information.html /MSc is.html /dms_msc.html /staff /research /campus Fig 4.3 Through the overview tree, user can click his/her interested area to reduce the scale of searching. This tool allows the user to move in and out of areas of interest within a collection. The user can zoom from an overview down to any area that looks promising. It allows for the user to quickly move back and forth while preserving his or her sense of position in the collection. Unfortunately, it is difficult to design a good interface to integrate category selection into query specification, because displaying category hierarchies takes up large amounts of screen space. Another difficult is how to classify the hundreds of web pages correctly and how to let web administrator add new web pages to the right category when they update 30
32 the web site. In many cases, a same web page can belong to various directions. Thus when the design was put into implementation, the difficult can t be overcome in a short time. Finally, the design was aborted in the final project implementation. Three searching types Three searching types are provided to users. The engine will has the ability to handle simple, complex and string queries. All Terms Searching - The entire search string will be matched; for example, a search for Information retrieval will return primarily web pages containing the "Information and retrieval. Any Term Searching - The search string will be matched by individual word; for example, a search for Information Retrieval will return documents containing the words Information or "Retrieval. Phrase Searching - That is, all terms must occur next to each other and in order. 4.5 Administrator interface Design Web sites administrator can index or update web pages with a web crawler through web interface. To access the administration page user needs a password due to the reason of protecting the database. Administrator can reset the password through the web browser as well. 31
33 Chapter 5 Implementation 5.1 Programming Language Choosing Perl is a scripting language and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. It can be used on most platforms and with most web servers. Moreover, the functions of Perl in word processing are very powerful. The programming language is very easy to learn and use. For these reasons, Perl was chosen as the programming language to build the search engine. 5.2 Find an appropriate template search tools Due to the limitation of time, in this project, some free template search tools have been used for developing the SCS search engine. Price, Platform, capacity, ease of installation, and maintenance are the factors, which were taken into consideration for selecting the template search tools. After comparing several search tools, Fluid Dynamics Search Engine was chosen as the original template SCS search engine. It is programmed in Perl. The script can run well on Unix and Windows platforms or any other Perl CGI environment. This script will retrieve the pages over the web sites, and a web page s text, keywords, description, title, and address are all extracted and used for searching. Then it saves these contents in a separated text file (index file I). During the search process, the script searches the index file to find all links related to queries of user. The number of keyword hits sorts retrieved web pages. The title, description, size, last modified time, and address 32
34 of each document are shown to the user in the list of hits. The admin can configure the number of hits to show per page. 5.3 Adding Stemming Algorithm The template search engine does not adopt term stemming. In the SCS search engine, Porter's stemming algorithm [15] was used to implement this function. The codes were obtained from web site. After some alteration, they were combined into the SCS search engine as a function. 5.4 Remove More Fluff Words In the template scripts, there is a list of fluff words, however the SCS search engine should remove more fluff words for the specific collection. More fluff words come from MSc IS14 Module handouts [19]. 5.5 Improving the User Interface In the template one, user can search as any term and All terms two types. In the SCS search engine, the user interface was altered and Phrase search type was added in the search form. The search script also was edited for Phrase search. The search terms are stored in index file with former order in original web pages. When user chooses this search type, the query terms need match the whole terms in the phrase in right order. 33
35 Fig 5.1 Search Form Fig 5.3 Results list Because the reasons that has been mentioned in the Chapter4, the implementation of category overviews was aborted. 34
36 5.6 Using Vector Model Binary model is used in the original template search engine. In the final SCS search engine, a vector model is added. For comparing the affection of two models on the performance of search engine, the binary model was still kept in the SCS search engine. User can choose one of them from search form. The terms weighting system was programmed in Perl. According to the theory in the Chapter2, each term has been give to a weight according to their frequency and inverse document frequency in the indexing time. A separated perl script was programmed to implement this task. Weighting Program Index File I Web page key number Position File Web page key number Position of the web page Index File II Term Weights of the web pages Fig5.2 The term s weights are stored in another separated text file (Index file II). In order to save the CPU time in real time searching, a position file is created to keep the position of each web page in the index file II. The search engine can obtain the term weights of any documents (web pages) in the index file I through the position file directly. The Term Stemming will affect the weight results in some ways. For example, through stemming, computer, compute, computing, computability, all of the four terms will be 35
37 replaced by comput. Thus the frequency of the four terms will be summarised up to comput one term. In this way, these web pages including above terms will have more chance to be used match the relevance topic about computer. Retrieved documents are sorted by ranking rate. The rates depend on the query term weights and document term weights. In order to make the query look more natural to user, we don t give any format limitation of input. User can input some Symbols in their query such as asterisk, quotes, and brackets, which won t affect the search results. After removing fluff words, the query terms are treated equally. The weights of document terms are used to compute the ranks of the retrieved web pages. The ranking function is programmed in Perl as well and combined in the SCS search engine. 36
38 Chapter 6 Testing, Evaluation and Conclusions 6.1 Testing Environment The SCS search engine was tested under the following environmental conditions. The configurations used were those available to author. Server Software -Windows 98 with Perl5 (running locally) -Linux server (running at Browser Routines -Windows 98, 56k Modem link to remote server and using local server with Internet Explorer 4.0 -Windows NT 4.0, networked, with Internet Explorer 5.0 and Navigator Evaluation Criteria First of all, as discussed in the first chapter, recall and precision are two of the most important measures for evaluating information retrieval system. Recall - the proportion of relevant documents retrieved, is not considered a viable measure for Internet search engines because it is impossible to determine how many relevant items there are for a particular query. However, recall may be viable measure for intranet search engines because in the relatively confined intranet environment, it may be possible to identify with reasonable surety the documents relevant to a particular query. 37
Modern information retrieval
Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationWebBiblio Subject Gateway System:
WebBiblio Subject Gateway System: An Open Source Solution for Internet Resources Management 1. Introduction Jack Eapen C. 1 With the advent of the Internet, the rate of information explosion increased
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationCSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database
CSC105, Introduction to Computer Science Lab02: Web Searching and Search Services I. Introduction and Background. The World Wide Web is often likened to a global electronic library of information. Such
More information21. Search Models and UIs for IR
21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationSession 10: Information Retrieval
INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!
More informationINFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE
15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find
More informationUsing SportDiscus (and Other Databases)
Using SportDiscus (and Other Databases) Databases are at the heart of research. Google is a database, and it receives almost 6 billion searches every day. Believe it or not, however, there are better databases
More informationPerfect Timing. Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation
Perfect Timing Alejandra Pardo : Manager Andrew Emrazian : Testing Brant Nielsen : Design Eric Budd : Documentation Problem & Solution College students do their best to plan out their daily tasks, but
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationEmerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc.
Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc. This paper provides an overview of a presentation at the Internet Librarian International conference in London
More informationPORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES.
PORTAL RESOURCES INFORMATION SYSTEM: THE DESIGN AND DEVELOPMENT OF AN ONLINE DATABASE FOR TRACKING WEB RESOURCES by Richard Spinks A Master s paper submitted to the faculty of the School of Information
More informationDesign and Implementation of Search Engine Using Vector Space Model for Personalized Search
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,
More informationInformation Retrieval
s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationInteraction Style Categories. COSC 3461 User Interfaces. What is a Command-line Interface? Command-line Interfaces
COSC User Interfaces Module 2 Interaction Styles What is a Command-line Interface? An interface where the user types commands in direct response to a prompt Examples Operating systems MS-DOS Unix Applications
More informationUser Guide. Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved.
User Guide Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved. Central Search User Guide Table of Contents Welcome to Central Search... 3 Starting Your Search... 4 Basic Search & Advanced
More informationVannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17
Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision
More informationContents 1. INTRODUCTION... 3
Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...
More informationPart 11: Collaborative Filtering. Francesco Ricci
Part : Collaborative Filtering Francesco Ricci Content An example of a Collaborative Filtering system: MovieLens The collaborative filtering method n Similarity of users n Methods for building the rating
More informationAutomatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey
I. Automatic Document; Retrieval Systems The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey decimal aystemf Library of Congress system). Cross-indexing
More informationSYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND
More informationThis session will provide an overview of the research resources and strategies that can be used when conducting business research.
Welcome! This session will provide an overview of the research resources and strategies that can be used when conducting business research. Many of these research tips will also be applicable to courses
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationWEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE
WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationTaxonomies and controlled vocabularies best practices for metadata
Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley
More informationCognitive Walkthrough Evaluation
Columbia University Libraries / Information Services Digital Library Collections (Beta) Cognitive Walkthrough Evaluation by Michael Benowitz Pratt Institute, School of Library and Information Science Executive
More informationAutomated Cognitive Walkthrough for the Web (AutoCWW)
CHI 2002 Workshop: Automatically Evaluating the Usability of Web Sites Workshop Date: April 21-22, 2002 Automated Cognitive Walkthrough for the Web (AutoCWW) Position Paper by Marilyn Hughes Blackmon Marilyn
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationAdvanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University
Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationSearching the Evidence using EBSCOHost
CAMBRIDGE UNIVERSITY LIBRARY MEDICAL LIBRARY Supporting Literature Searching Searching the Evidence using EBSCOHost ATHENS CINAHL Use to search CINAHL with an NHS ATHENS login (or PsycINFO with University
More informationAutomated Item Banking and Test Development Model used at the SSAC.
Automated Item Banking and Test Development Model used at the SSAC. Tural Mustafayev The State Student Admission Commission of the Azerbaijan Republic Item Bank Department Item Banking For many years tests
More informationCOCHRANE LIBRARY. Contents
COCHRANE LIBRARY Contents Introduction... 2 Learning outcomes... 2 About this workbook... 2 1. Getting Started... 3 a. Finding the Cochrane Library... 3 b. Understanding the databases in the Cochrane Library...
More informationDocument Clustering for Mediated Information Access The WebCluster Project
Document Clustering for Mediated Information Access The WebCluster Project School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationCHAPTER-26 Mining Text Databases
CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationUsing Scopus. Scopus. To access Scopus, go to the Article Databases tab on the library home page and browse by title.
Using Scopus Databases are the heart of academic research. We would all be lost without them. Google is a database, and it receives almost 6 billion searches every day. Believe it or not, however, there
More informationSearch Engine Optimization (SEO) using HTML Meta-Tags
2018 IJSRST Volume 4 Issue 9 Print ISSN : 2395-6011 Online ISSN : 2395-602X Themed Section: Science and Technology Search Engine Optimization (SEO) using HTML Meta-Tags Dr. Birajkumar V. Patel, Dr. Raina
More informationCHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES
188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two
More informationUser Interfaces Assignment 3: Heuristic Re-Design of Craigslist (English) Completed by Group 5 November 10, 2015 Phase 1: Analysis of Usability Issues Homepage Error 1: Overall the page is overwhelming
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationCPS122 Lecture: From Python to Java last revised January 4, Objectives:
Objectives: CPS122 Lecture: From Python to Java last revised January 4, 2017 1. To introduce the notion of a compiled language 2. To introduce the notions of data type and a statically typed language 3.
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationIntelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD
World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationSOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2)
SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay Lecture #10 Process Modelling DFD, Function Decomp (Part 2) Let us continue with the data modeling topic. So far we have seen
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationReference By Any Other Name
Boise State University ScholarWorks Library Faculty Publications and Presentations The Albertsons Library 4-1-2010 Reference By Any Other Name Ellie Dworak Boise State University This document was originally
More informationActive Server Pages Architecture
Active Server Pages Architecture Li Yi South Bank University Contents 1. Introduction... 2 1.1 Host-based databases... 2 1.2 Client/server databases... 2 1.3 Web databases... 3 2. Active Server Pages...
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationSkill Area 209: Use Internet Technology. Software Application (SWA)
Skill Area 209: Use Internet Technology Software Application (SWA) Skill Area 209.1 Use Browser for Research (10hrs) 209.1.1 Familiarise with the Environment of Selected Browser Internet Technology The
More informationProvided by TryEngineering.org -
Provided by TryEngineering.org - Lesson Focus Lesson focuses on exploring how the development of search engines has revolutionized Internet. Students work in teams to understand the technology behind search
More informationInternational ejournals
Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT
More informationInformation Retrieval and Data Mining Part 1 Information Retrieval
Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information
More informationAdaptable and Adaptive Web Information Systems. Lecture 1: Introduction
Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationCABI Training Materials. Ovid Silver Platter (SP) platform. Simple Searching of CAB Abstracts and Global Health KNOWLEDGE FOR LIFE.
CABI Training Materials Ovid Silver Platter (SP) platform Simple Searching of CAB Abstracts and Global Health www.cabi.org KNOWLEDGE FOR LIFE Contents The OvidSP Database Selection Screen... 3 The Ovid
More informationWWW and Web Browser. 6.1 Objectives In this chapter we will learn about:
WWW and Web Browser 6.0 Introduction WWW stands for World Wide Web. WWW is a collection of interlinked hypertext pages on the Internet. Hypertext is text that references some other information that can
More informationDue on: May 12, Team Members: Arpan Bhattacharya. Collin Breslin. Thkeya Smith. INFO (Spring 2013): Human-Computer Interaction
Week 6 Assignment: Heuristic Evaluation of Due on: May 12 2013 Team Members: Arpan Bhattacharya Collin Breslin Thkeya Smith INFO 608-902 (Spring 2013): Human-Computer Interaction Group 1 HE Process Overview
More informationVision Document for Multi-Agent Research Tool (MART)
Vision Document for Multi-Agent Research Tool (MART) Version 2.0 Submitted in partial fulfillment of the requirements for the degree MSE Madhukar Kumar CIS 895 MSE Project Kansas State University 1 1.
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationUSER SEARCH INTERFACES. Design and Application
USER SEARCH INTERFACES Design and Application KEEP IT SIMPLE Search is a means towards some other end, rather than a goal in itself. Search is a mentally intensive task. Task Example: You have a friend
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationevision Review Project - Engagement Simon McLean, Head of Web & IT Support Information & Data Services.
evision Review Project - Engagement Monitoring Simon McLean, Head of Web & IT Support Information & Data Services. What is Usability? Why Bother? Types of usability testing Usability Testing in evision
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Boolean model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationCreating a Course Web Site
Creating a Course Web Site What you will do: Use Web templates Use shared borders for navigation Apply themes As an educator or administrator, you are always looking for new and exciting ways to communicate
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationWeb-interface for Monte-Carlo event generators
Web-interface for Monte-Carlo event generators Jonathan Blender Applied and Engineering Physics, Cornell University, Under Professor K. Matchev and Doctoral Candidate R.C. Group Sponsored by the University
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationInfluence of Word Normalization on Text Classification
Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationINFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects
INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051/ Final Group Projects Groups of variable
More informationPredictive Coding. A Low Nerd Factor Overview. kpmg.ch/forensic
Predictive Coding A Low Nerd Factor Overview kpmg.ch/forensic Background and Utility Predictive coding is a word we hear more and more often in the field of E-Discovery. The technology is said to increase
More informationDRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015
DRACULA CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 Acknowledgments Support for this work was provided by the National Science Foundation Award No. CMMI-1304383 and CMMI-1234859. Any opinions,
More informationCSI Lab 02. Tuesday, January 21st
CSI Lab 02 Tuesday, January 21st Objectives: Explore some basic functionality of python Introduction Last week we talked about the fact that a computer is, among other things, a tool to perform high speed
More information