Implementation of the common phrase index method on the phrase query for information retrieval

Size: px
Start display at page:

Download "Implementation of the common phrase index method on the phrase query for information retrieval"

Transcription

1 Implementation of the common phrase index method on the phrase query for information retrieval Triyah Fatmawati, Badrus Zaman, and Indah Werdiningsih Citation: AIP Conference Proceedings 1867, (2017); View online: View Table of Contents: Published by the American Institute of Physics Articles you may be interested in Categorizing document by fuzzy C-Means and K-nearest neighbors approach AIP Conference Proceedings 1867, (2017); / Application of wavelet transformation and adaptive neighborhood based modified backpropagation (ANMBP) for classification of brain cancer AIP Conference Proceedings 1867, (2017); / D multiplayer virtual pets game using Google Card Board AIP Conference Proceedings 1867, (2017); / An artificial immune system algorithm approach for reconfiguring distribution network AIP Conference Proceedings 1867, (2017); / Be-safe travel, a web-based geographic application to explore safe-route in an area AIP Conference Proceedings 1867, (2017); / Constrained H control for low bandwidth active suspensions AIP Conference Proceedings 1867, (2017); /

2 Implementation of The Common Phrase Index Method on The Phrase Query for Information Retrieval Triyah Fatmawati a), Badrus Zaman b), Indah Werdiningsih c) Information System Study Program, Faculty of Science and Technology, Airlangga University, Surabaya, Indonesia a) b) c) Abstract. As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F- measure of From this result can be said that the success rate of the system to produce relevant documents is low. INTRODUCTION Based on Oxford Dictionaries, phrase is a small group of words standing together as a conceptual unit, typically forming a component of a clause. The arrangement positions of phrases can cause different meanings (Mao et al., 2006). This will have an impact on the accuracy of the information obtained. The phrase is generally used in a sentence, as in the news text. News is the source of the information needed to determine the events that are happening. As the development of technology, the process of finding information on the news text is easy, because the news text is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. Phrase query that is entered by user can be composed of two or more words. According to Patterson et al (2008), the rise in the complexity of the query may cause increased possibility of error. As a result, the accuracy of the information obtained will be affected. In addition, the word position of the query must be considered. This is because the different word position of the query can cause different meanings, so that the information obtained from the search process will also be different. Therefore, we need a method to solve the problem of phrase query, that is indexing. Indexing is the main process in the information retrieval system (Mao et al., 2006). International Conference on Mathematics: Pure, Applied and Computation AIP Conf. Proc. 1867, ; doi: / Published by AIP Publishing /$

3 Indexing can be done by several methods, including inverted index, auxiliary nextword index, and the common phrase index (Manning et al., 2009; Bahle et al., 2002; Chang and Poon, 2007). Inverted index or inverted file is a basic concept in information retrieval (Manning et al., 2009). On implementation, each sentence in the document is broken down into term. The advantage of this method is providing fast searching through an enomous amount of document (Chang and Poon, 2007). In contrast, for phrase querying it is not simple to predict occurences of the term will be in a query phrase, and thus such reordering is unlikely to be effective (Bahle et al., 2002). Auxiliary nextword index is an indexing method that could reduce costs and the requirement for accessing a disk which can increase efficiency, but the auxiliary nextword index method can only work very well on a phrase query size of two (Chang and Poon, 2007). Common phrase index is an alternative for phrase query optimization (Chang and Poon, 2007). The advantages of this method is to have a vocabulary like tree structure, so that it can be used on a query that has a size of more than two. Considered from the advantages of each method, the method that will be used in the research is the common phrase index. According to Chang and Poon (2007), each word in the common phrase index method are grouped into common and rare word. Common words are words that has the high frequency. Conversely, rare words are words that have small frequency. Unlike the auxiliary nextword index, a common phrase index method has tree-like structure, which is composed of the roots to the leaves. Each structure in the common phrase index represents the phrase, which is started from the common word and terminated by the terminal word. Based on the outlined problem, then the purpose of this research is to analyze the implementation of the common phrase index method on information retrieval. The research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. RESEARCH METHODS The research on the implementation of common phrase index method on the phrase query for information retrieval will be done with 4 main steps, they are pre-processing, indexing, term weighting, and cosine similarity. The steps in this research is shown in Figure 1. FIGURE 1. Research Method

4 Pre-Processing The pre-processing stage is done every time a document and query is inserted into the system. In this stage, there are two steps taken, namely tokenizing and stemming. Tokenizing is the process of breaking the sentence into words. Stemming is the mapping process of decomposition from various word forms, both prefix, suffix, and the joint of prefix and suffix (confix), to basic word (stem) (Zaman and Winarko, 2011). For English text, the algorithm that is used is Porter Stemmer algorithm. Indexing Before indexing process, the step that is conducted is determine the common word. In the process of determining the common word, each term that has been formed will be calculated for the frequency of occurrence in all documents and add up the total frequency. The total number of frequencies is then used to determine the threshold as a common word, which is limited by calculating the chance appearance of each term and cumulative number. In this research, the determination of the common word is limited by cumulative total of 75%. At the indexing stage, every word formed from the pre-processing stage will be matched with the common word list. If a word is included in the common word list, then the word will be merged with adjacent word. Merger process will stop when it has met terminal word, that is verb, noun, or adjective. However, if a word is not included in the common word list, it will not be merged. Term Weighting Calculation Term weighting calculation is performed on each term to each document and query using equation 1 (Manning et al., 2009). tf idft, d tf t, d idft (1) Tf is obtained by calculating the frequency of occurrence of the term, indexing result, both in the documents and queries. While idf can be calculated using equation 2, where df is the calculation result of the total documents which are containing that term. N idft log (2) df Cosine Similarity Calculation Cosine similarity calculation can be done using equation 3 (Manning et al., 2009). Cosine similarity is used to see the similarity between query and corpus that has been stored. v( q) v( d) score( q, d) (3) v( q) v( d) RESULTS AND DISCUSSION Data Collection Data that is used is English news articles, about the information technology, with a simple structure that is taken from the online news site of The Jakarta Post. News articles collected were 100 documents that were downloaded in December These documents are then used to test the system in order to determine the level of relevance of search results. Figure 2 shows an example document. t

5 FIGURE 2. Examples of Documents Pre-Processing In the pre-processing phase, the steps taken are tokenizing and stemming. Tokenizing is the process of splitting each document and query into words. In tokenizing process, the steps that are performed by the system are altering uppercase to lowercase, eliminating characters (punctuation), and detects spaces contained in the document and query to be a splitting solution. In stemming process, the system will remove the prefixes and suffixes using Porter Stemmer algorithm. Indexing The indexing process is carried out using a common phrase index method. The indexing algorithm is shown in Figure 3. In practice, this method requires a common and terminal word list. The common word list can be changed every time a new document is inserted into the system. The terminal word list won t be changed and only loads verb, noun, and adjective list. FIGURE 3. Indexing Algorithm Using Common Phrase Index

6 In the indexing process, every word formed from pre-processing stage, matched with the common word list. If a word is included in the list of common words, then the word will be merged with adjacent word. Merger process will stop when meeting with a terminal word, so it will be formed a term array which is composed by two or more words. However, if a word is not included in the list of common words, it will not be merged, so it will only be formed a single term. Term Weighting Calculation In general, the process of calculating the weighting term is divided into four stages, namely calculation of term frequencies, document frequency, inverse document frequency, and term weighting. The first stage is to calculate the term frequency (tf) of each term array indexing results, both documents and queries. Each time a document or query is entered, the system will calculate all of tf of the term array of documents or the query. If the system receives another document or query, the calculation of tf of each term array do from start. The second step is calculating the document frequency (df) on each term array. In this calculation, any new documents entered into the system, the system will add df if in the document there are term arrays, which is same with an term arrays contained in previous documents. The third step is to calculate the inverse document frequency ( ). Idf calculation can be done with the equation 2. Example calculation of df and idf the sample document is shown in Table 1. The fourth step is to calculate the value of term weighting (tf-idf) of each term arrays on each document and query. Tf-idf calculation can be performed using Equation 1. Example tf-idf calculation results for the document is shown in Table 2. As an example the calculation results for the query tf-idf are shown in Table 3. TABLE 1 Calculation Results of Df and Idf Term Array Df Idf [microsoft, sai] 1 2 [now, run] 1 2 [store] 1 2 [reaction] 1 2 TABLE 2 Example of Term Weighting Calculation for Document Term Phrase ID Doc ID Term Array Tf Idf Tf-Idf TermPhrase1 Dok5 [microsoft, sai] TermPhrase10 Dok5 [now, run] TermPhrase100 Dok5 [store] TermPhrase101 Dok5 [reaction] TABLE 3 Example of Term Weighting Calculation for Query Term Phrase ID Query ID Term Array Tf Idf Tf-Idf TermPhrase148 Query1 [googl, map] TermPhrase149 Query1 [map, applic] TermPhrase150 Query1 [applic] TermPhrase663 Query10 [yahoo, ] Cosine Similarity Calculation At this stage, the cosine similarity calculation is done for each query to each document using equation 3. Table 4 is an example of the cosine similarity calculation to the query 1. No. TABLE 4 Calculation Result of Cosine Similarity for Query 1 Cosine Similarity ID Doc ID Query ID Value of Cosine Similarity 1. Cosine1 Dok1 Query Cosine2 Dok2 Query Cosine3 Dok3 Query Cosine4 Dok4 Query Cosine5 Dok5 Query

7 System Testing In the testing stage, we used as many as 100 documents and queries as many as 20. On the implementation of the testing, there are two scenarios that have been done. The first scenario is entering 100.txt documents through the feature for input document. The second scenario is entering 20 queries through the search feature. Then the system will display the document, the search results, serially on the table. However, the system only displays the document with cosine similarity value greater than 0. All queries which is used in this stage and documents that is resulted by the system are shown in Table 5. TABLE 5 Query for Testing and Document Search Results of System Query ID Query Document ID of Search Results Amount Query 1 Google maps application 3, 51, 40, 65 4 Query 2 Marshmallow android mobile device 21, 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 33 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, 82 Query 3 Microsoft s Windows 10 78, 81 2 Query 4 New Google s logo 33, 54 2 Query 5 Smartphone for selfie lovers 5,86, 42, 16, 52, 75 6 Query 6 Smartphone 4.5G 60 1 Query 7 Online streaming music 61 1 Query 8 New smartwatch - 0 Query 9 Payroll applications 100, 74 2 Query 10 Yahoo s - 0 Query 11 Dislike button on Facebook 25, 41, 63, 9, 12, 5, 14, 8, 15, Query 12 New AADC sticker 96, 63 2 Query 13 Mobile device innovation 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 17 28, 86, 72, 99, 82 Query 14 New features in IPhone 6 2, 26, 29, 23, 35, 83, 63, 24, 51 9 Query 15 Mobile payments 88, 54, 34, 69 4 Query 16 Mobile operating system 73, 36, 37, 23, 71, 80, 72, 78, 26, 83, 54, Query 17 Cyber hacker 87, 73, 89 3 Query 18 Cloud computing 27, 81 2 Query 19 Smartphone market 44, 59, 30, 45, 23, 28, 73 7 Query 20 Messenger apps 43, 63, 8 3 System Evaluation System evaluation stage is divided into two stages, namely the determination of the relevant documents and the calculation of the level of success in the search system. Determination of Relevant Documents Determination of the relevant documents is done by calculating the kappa statistic using judge as many as three people. To be able to know the value of kappa statistic, there are 3 steps that need to be done, namely the determination of the relevant documents, the relevance table creation, and the kappa statistic calculation using equation 4. According to Manning et al (2009), as a rule of thumb, a kappa value above 0.8 is taken as good agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement below 0.67 is seen as data providing a dubious basis for an evaluation. P( A) P( E) kappa (4) 1 P( E) From the calculations that have been done, we obtained an average value of kappa statistic equals It can be said that the relevant documents according to the judge 1, 2, and 3 can be used for evaluation. Table 6 shows the results of calculation value of kappa statistic whereas Table 7 shows relevant documents by the judges

8 TABLE 6 Calculation Results of Kappa Statistic Query Judge 1 and Judge 2 Judge 1 and Judge 3 Judge2 and Judge 3 Average Average TABLE 7 Relevant Documents List Query Judge 1 Judge 2 Judge 3 Relevant Amount Documents 1 1, 3, 40, 58 1, 3, 32, 40, 58 3, 32, 40 1, 3, 32, 40, , 35, 78, 80, 83 35, 78, 80, 82, 83 11, 17, 35, 78, 35, 78, 80, 82, , 82, , 41 33, , , 39 5, , , , 99 61, 99 61, 99 61, , 36 34, 36 34, 36 34, , 88, , , 50, 77 48, 50, , 50, , 25 6, 25 6, 25 6, , 18, 19, 20, 22, 27, 28, 44, 53, 59, 65, 73, 92, 93, 95, 98 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 95 19, 21, 30, 92, 93, 95 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, , 26 24, 26 24, 26 24, , 88, , 21, 45, 46 11, 45, , , 23, 56, 62, 87, 7, 23, 56, 62, 87, 56, 62, 73, 87, 7, 23, 56, 62, 87, 7 89, 97 89, 97 89, 97 89, , , , 44, 55 21, 30, 44, 55 44, 55 30, 44, , 13, 43, 49 4, 13, 43, , 13,

9 System Success Rate Calculation The success rate of the system can be seen from the calculation of precision, recall, and F-measure. There are three steps that must be carried out, comparing the document of system search result with relevant documents, such as Table 8, calculate the value of precision and recall using equations 5 and 6 (O'Sullivan et al., 2010), as well as calculating the F-measure value using the equation 7 (Büttcher et al., 2010; Manning et al., 2009; Zaman and Winarko, 2011). R( E) R( A) precision (5) R( A) R( E) R( A) recall (6) R( E) PR F 2 (7) P R Where R(E) is the set of relevant document, R(A) is the set of document search result by system, P is precision, and R is recall. The result of the calculation precision, recall, and F-measure to document results are shown in Table 9. Query Relevant Documents TABLE 8 Comparison of Relevant Documents to System Search Results Document Search Results Relevant Documents Document Search Results Amount (Relevant Documents Document Search Results) 1 1, 3, 32, 40, 58 3, 51, 40, 65 3, , 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, , 78, 80, 82, 78, , 41 33, , 39 5,86, 42, 16, 52, , , , , 50, , 25 25, 41, 63, 9, 12, 5, 14, 8, , , , 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, 95 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 28, 86, 72, 99, 82 28, , 26 2, 26, 29, 23, 35, 83, 63, 24, 24, , 54, 34, , 45 73, 36, 37, 23, 71, 80, 72, , 26, 83, 54, , 23, 56, 62, 87, 73, 89 87, , 89, , , , 44, 55 44, 59, 30, 45, 23, 28, 73 30, , 13, 43 43, 63,

10 TABLE 9 Calculation Results of Precision, Recall, and F-measure Query Precision Recall F-measure Average From Table 9 it can be seen that the average value of precision is smaller than average value of recall. This is caused by the number of documents search results performed by the system more than the number of relevant documents produced by the system slightly. In addition, from the average of F-measure, it can be said that the level of success in the search system is low. CONCLUSION From the research that has been done, we obtained some conclusions as follows: The system displays the outcome document in sequence based on the cosine similarity, from the biggest to the smallest value. Documents produced by the system is a document with cosine similarity value greater than 0. Based on the evaluation, the degree of relevance of the document resulting from the search process using the common phrase index is low. It can be seen from the calculation of precision, recall, and F-measure, that are respectively by 0.37, 0.50, and REFERENCES 1. Bahle, D., Williams, H. E., and Zobel, J. (2002). Efficient Phrase Querying with an Auxiliary Index. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. 2. Büttcher, S., Clarke, C. L., and Cormack, G. V. (2010). Information Retrieval Implementing and Evaluating Search Engines. Massachusetts: The MIT Press. 3. Chang, M., and Poon, C. K. (2007). Efficient Phrase Querying with Common Phrase Index. Information Processing and Management 44 (2008), Manning, C. D., Raghavan, P., and Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge, England: Cambridge University Press. 5. Mao, W., and Chu, W. W. (2006). The Phrase-Based Vector Space Model for Automatic Retrieval of Free-Text Medical Documents. Data & Knowledge Engineering 61 (2007), O Sullivan, D. M., Wilk, S. A., Michalowski, W. J., and Farion, K. J. (2010). Automatic Indexing and Retrieval of Encounter-Specific Evidence for Point-of-Care Support. Journal of Biomedical Informatics 43 (2010), Patterson, K., Watters, C., and Shepherd, M. (2008). Document Retrieval using Proximity-Based Phrase Searching. Proceedings of the 41st Hawaii International Conference on System Sciences. IEEE. 8. Oxford Dictionaries online. retrieved on 19 November Zaman, B., & Winarko, E. (2011). Analisis Fitur Kalimat untuk Peringkas Teks Otomatis pada Bahasa Indonesia. IJCCS, Vol.5 No.2, Juli, 2011,

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval Annals of the University of Craiova, Mathematics and Computer Science Series Volume 44(2), 2017, Pages 238 248 ISSN: 1223-6934 Custom IDF weights for boosting the relevancy of retrieved documents in textual

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering

The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering Wiranto Informatics Department Sebelas Maret University Surakarta, Indonesia Edi Winarko Department of Computer

More information

Adaptive Model of Personalized Searches using Query Expansion and Ant Colony Optimization in the Digital Library

Adaptive Model of Personalized Searches using Query Expansion and Ant Colony Optimization in the Digital Library International Conference on Information Systems for Business Competitiveness (ICISBC 2013) 90 Adaptive Model of Personalized Searches using and Ant Colony Optimization in the Digital Library Wahyu Sulistiyo

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Extracting Summary from Documents Using K-Mean Clustering Algorithm

Extracting Summary from Documents Using K-Mean Clustering Algorithm Extracting Summary from Documents Using K-Mean Clustering Algorithm Manjula.K.S 1, Sarvar Begum 2, D. Venkata Swetha Ramana 3 Student, CSE, RYMEC, Bellary, India 1 Student, CSE, RYMEC, Bellary, India 2

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document

Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document Andreas Handojo, Adi Wibowo, Yovita Ria Informatics Engineering Department Faculty of Industrial Technology,

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Introduction & Administrivia

Introduction & Administrivia Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,

More information

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007 Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time: English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

Query-Free News Search

Query-Free News Search Query-Free News Search by Monika Henzinger, Bay-Wei Chang, Sergey Brin - Google Inc. Brian Milch - UC Berkeley presented by Martin Klein, Santosh Vuppala {mklein, svuppala}@cs.odu.edu ODU, Norfolk, 03/21/2007

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

INLS W: Information Retrieval Systems Design and Implementation. Fall 2009.

INLS W: Information Retrieval Systems Design and Implementation. Fall 2009. INLS 490-154W: Information Retrieval Systems Design and Implementation. Fall 2009. 3. Learning to index Chirag Shah School of Information & Library Science (SILS) UNC Chapel Hill NC 27599 chirag@unc.edu

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Full-Text Indexing For Heritrix

Full-Text Indexing For Heritrix Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design

More information

A short introduction to the development and evaluation of Indexing systems

A short introduction to the development and evaluation of Indexing systems A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main

More information

Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language

Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language Nur Maulidiah Elfajr, Riyananto Sarno Department of Informatics, Faculty of Information and Communication Technology

More information

Noida institute of engineering and technology,greater noida

Noida institute of engineering and technology,greater noida Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task Hiroki Tanioka Kenichi Yamamoto Justsystem Corporation Brains Park Tokushima-shi, Tokushima 771-0189, Japan {hiroki tanioka, kenichi yamamoto}@justsystem.co.jp

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Relevance of a Document to a Query

Relevance of a Document to a Query Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES

ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES Shaufiah, Imanudin and Ibnu Asror Mining Center Laboratory, School of Computing, Telkom University, Bandung, Indonesia E-Mail:

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Web Page Similarity Searching Based on Web Content

Web Page Similarity Searching Based on Web Content Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id

More information

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES

FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES Swathi Polamraju and Sricharan Ramagiri Department of Electrical and Computer Engineering Clemson University ABSTRACT: Being motivated by the

More information

Text Pre-processing and Faster Query Processing

Text Pre-processing and Faster Query Processing Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Administrative Everyone have CS lab accounts/access?

More information

Contents 1. INTRODUCTION... 3

Contents 1. INTRODUCTION... 3 Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...

More information

Meliper: Making News Personal

Meliper: Making News Personal Meliper: Making News Personal Arden Dertat Department of Computer Science Brown University Providence, RI 02912 arden@cs.brown.edu Abstract Getting fresh and relevant news about our interest topics is

More information

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

An Adaptive Agent for Web Exploration Based on Concept Hierarchies An Adaptive Agent for Web Exploration Based on Concept Hierarchies Scott Parent, Bamshad Mobasher, Steve Lytinen School of Computer Science, Telecommunication and Information Systems DePaul University

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

A Security Model for Multi-User File System Search. in Multi-User Environments

A Security Model for Multi-User File System Search. in Multi-User Environments A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada December 15, 2005 1 Introduction and Motivation 2 3 4 5

More information

CS371R: Final Exam Dec. 18, 2017

CS371R: Final Exam Dec. 18, 2017 CS371R: Final Exam Dec. 18, 2017 NAME: This exam has 11 problems and 16 pages. Before beginning, be sure your exam is complete. In order to maximize your chance of getting partial credit, show all of your

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

Java Archives Search Engine Using Byte Code as Information Source

Java Archives Search Engine Using Byte Code as Information Source Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id

More information

EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL

EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL J Naveen Kumar 1, Dr. M. Janga Reddy 2 1 jnaveenkumar6@gmail.com, 2 pricipalcmrit@gmail.com 1 M.Tech Student, Department of Computer Science, CMR Institute of Technology,

More information

dr.ir. D. Hiemstra dr. P.E. van der Vet

dr.ir. D. Hiemstra dr. P.E. van der Vet dr.ir. D. Hiemstra dr. P.E. van der Vet Abstract Over the last 20 years genomics research has gained a lot of interest. Every year millions of articles are published and stored in databases. Researchers

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Multimodal Information Spaces for Content-based Image Retrieval

Multimodal Information Spaces for Content-based Image Retrieval Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due

More information

Indexing and Query Processing

Indexing and Query Processing Indexing and Query Processing Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu January 28, 2013 Basic Information Retrieval Process doc doc doc doc doc information need document representation

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia Vol8/No1 (2016) INERNEWORKING INDONESIA JOURNAL 83 Implementation of LSI Method on Information Retrieval for ext in Bahasa Indonesia Jasman Pardede and Mira Musrini Barmawi Abstract Information retrieval

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

Approach Research of Keyword Extraction Based on Web Pages Document

Approach Research of Keyword Extraction Based on Web Pages Document 2017 3rd International Conference on Electronic Information Technology and Intellectualization (ICEITI 2017) ISBN: 978-1-60595-512-4 Approach Research Keyword Extraction Based on Web Pages Document Yangxin

More information

Information Retrieval

Information Retrieval Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Using NLP and context for improved search result in specialized search engines

Using NLP and context for improved search result in specialized search engines Mälardalen University School of Innovation Design and Engineering Västerås, Sweden Thesis for the Degree of Bachelor of Science in Computer Science DVA331 Using NLP and context for improved search result

More information