Implementation of the common phrase index method on the phrase query for information retrieval
|
|
- Wilfred Doyle
- 6 years ago
- Views:
Transcription
1 Implementation of the common phrase index method on the phrase query for information retrieval Triyah Fatmawati, Badrus Zaman, and Indah Werdiningsih Citation: AIP Conference Proceedings 1867, (2017); View online: View Table of Contents: Published by the American Institute of Physics Articles you may be interested in Categorizing document by fuzzy C-Means and K-nearest neighbors approach AIP Conference Proceedings 1867, (2017); / Application of wavelet transformation and adaptive neighborhood based modified backpropagation (ANMBP) for classification of brain cancer AIP Conference Proceedings 1867, (2017); / D multiplayer virtual pets game using Google Card Board AIP Conference Proceedings 1867, (2017); / An artificial immune system algorithm approach for reconfiguring distribution network AIP Conference Proceedings 1867, (2017); / Be-safe travel, a web-based geographic application to explore safe-route in an area AIP Conference Proceedings 1867, (2017); / Constrained H control for low bandwidth active suspensions AIP Conference Proceedings 1867, (2017); /
2 Implementation of The Common Phrase Index Method on The Phrase Query for Information Retrieval Triyah Fatmawati a), Badrus Zaman b), Indah Werdiningsih c) Information System Study Program, Faculty of Science and Technology, Airlangga University, Surabaya, Indonesia a) b) c) Abstract. As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F- measure of From this result can be said that the success rate of the system to produce relevant documents is low. INTRODUCTION Based on Oxford Dictionaries, phrase is a small group of words standing together as a conceptual unit, typically forming a component of a clause. The arrangement positions of phrases can cause different meanings (Mao et al., 2006). This will have an impact on the accuracy of the information obtained. The phrase is generally used in a sentence, as in the news text. News is the source of the information needed to determine the events that are happening. As the development of technology, the process of finding information on the news text is easy, because the news text is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. Phrase query that is entered by user can be composed of two or more words. According to Patterson et al (2008), the rise in the complexity of the query may cause increased possibility of error. As a result, the accuracy of the information obtained will be affected. In addition, the word position of the query must be considered. This is because the different word position of the query can cause different meanings, so that the information obtained from the search process will also be different. Therefore, we need a method to solve the problem of phrase query, that is indexing. Indexing is the main process in the information retrieval system (Mao et al., 2006). International Conference on Mathematics: Pure, Applied and Computation AIP Conf. Proc. 1867, ; doi: / Published by AIP Publishing /$
3 Indexing can be done by several methods, including inverted index, auxiliary nextword index, and the common phrase index (Manning et al., 2009; Bahle et al., 2002; Chang and Poon, 2007). Inverted index or inverted file is a basic concept in information retrieval (Manning et al., 2009). On implementation, each sentence in the document is broken down into term. The advantage of this method is providing fast searching through an enomous amount of document (Chang and Poon, 2007). In contrast, for phrase querying it is not simple to predict occurences of the term will be in a query phrase, and thus such reordering is unlikely to be effective (Bahle et al., 2002). Auxiliary nextword index is an indexing method that could reduce costs and the requirement for accessing a disk which can increase efficiency, but the auxiliary nextword index method can only work very well on a phrase query size of two (Chang and Poon, 2007). Common phrase index is an alternative for phrase query optimization (Chang and Poon, 2007). The advantages of this method is to have a vocabulary like tree structure, so that it can be used on a query that has a size of more than two. Considered from the advantages of each method, the method that will be used in the research is the common phrase index. According to Chang and Poon (2007), each word in the common phrase index method are grouped into common and rare word. Common words are words that has the high frequency. Conversely, rare words are words that have small frequency. Unlike the auxiliary nextword index, a common phrase index method has tree-like structure, which is composed of the roots to the leaves. Each structure in the common phrase index represents the phrase, which is started from the common word and terminated by the terminal word. Based on the outlined problem, then the purpose of this research is to analyze the implementation of the common phrase index method on information retrieval. The research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. RESEARCH METHODS The research on the implementation of common phrase index method on the phrase query for information retrieval will be done with 4 main steps, they are pre-processing, indexing, term weighting, and cosine similarity. The steps in this research is shown in Figure 1. FIGURE 1. Research Method
4 Pre-Processing The pre-processing stage is done every time a document and query is inserted into the system. In this stage, there are two steps taken, namely tokenizing and stemming. Tokenizing is the process of breaking the sentence into words. Stemming is the mapping process of decomposition from various word forms, both prefix, suffix, and the joint of prefix and suffix (confix), to basic word (stem) (Zaman and Winarko, 2011). For English text, the algorithm that is used is Porter Stemmer algorithm. Indexing Before indexing process, the step that is conducted is determine the common word. In the process of determining the common word, each term that has been formed will be calculated for the frequency of occurrence in all documents and add up the total frequency. The total number of frequencies is then used to determine the threshold as a common word, which is limited by calculating the chance appearance of each term and cumulative number. In this research, the determination of the common word is limited by cumulative total of 75%. At the indexing stage, every word formed from the pre-processing stage will be matched with the common word list. If a word is included in the common word list, then the word will be merged with adjacent word. Merger process will stop when it has met terminal word, that is verb, noun, or adjective. However, if a word is not included in the common word list, it will not be merged. Term Weighting Calculation Term weighting calculation is performed on each term to each document and query using equation 1 (Manning et al., 2009). tf idft, d tf t, d idft (1) Tf is obtained by calculating the frequency of occurrence of the term, indexing result, both in the documents and queries. While idf can be calculated using equation 2, where df is the calculation result of the total documents which are containing that term. N idft log (2) df Cosine Similarity Calculation Cosine similarity calculation can be done using equation 3 (Manning et al., 2009). Cosine similarity is used to see the similarity between query and corpus that has been stored. v( q) v( d) score( q, d) (3) v( q) v( d) RESULTS AND DISCUSSION Data Collection Data that is used is English news articles, about the information technology, with a simple structure that is taken from the online news site of The Jakarta Post. News articles collected were 100 documents that were downloaded in December These documents are then used to test the system in order to determine the level of relevance of search results. Figure 2 shows an example document. t
5 FIGURE 2. Examples of Documents Pre-Processing In the pre-processing phase, the steps taken are tokenizing and stemming. Tokenizing is the process of splitting each document and query into words. In tokenizing process, the steps that are performed by the system are altering uppercase to lowercase, eliminating characters (punctuation), and detects spaces contained in the document and query to be a splitting solution. In stemming process, the system will remove the prefixes and suffixes using Porter Stemmer algorithm. Indexing The indexing process is carried out using a common phrase index method. The indexing algorithm is shown in Figure 3. In practice, this method requires a common and terminal word list. The common word list can be changed every time a new document is inserted into the system. The terminal word list won t be changed and only loads verb, noun, and adjective list. FIGURE 3. Indexing Algorithm Using Common Phrase Index
6 In the indexing process, every word formed from pre-processing stage, matched with the common word list. If a word is included in the list of common words, then the word will be merged with adjacent word. Merger process will stop when meeting with a terminal word, so it will be formed a term array which is composed by two or more words. However, if a word is not included in the list of common words, it will not be merged, so it will only be formed a single term. Term Weighting Calculation In general, the process of calculating the weighting term is divided into four stages, namely calculation of term frequencies, document frequency, inverse document frequency, and term weighting. The first stage is to calculate the term frequency (tf) of each term array indexing results, both documents and queries. Each time a document or query is entered, the system will calculate all of tf of the term array of documents or the query. If the system receives another document or query, the calculation of tf of each term array do from start. The second step is calculating the document frequency (df) on each term array. In this calculation, any new documents entered into the system, the system will add df if in the document there are term arrays, which is same with an term arrays contained in previous documents. The third step is to calculate the inverse document frequency ( ). Idf calculation can be done with the equation 2. Example calculation of df and idf the sample document is shown in Table 1. The fourth step is to calculate the value of term weighting (tf-idf) of each term arrays on each document and query. Tf-idf calculation can be performed using Equation 1. Example tf-idf calculation results for the document is shown in Table 2. As an example the calculation results for the query tf-idf are shown in Table 3. TABLE 1 Calculation Results of Df and Idf Term Array Df Idf [microsoft, sai] 1 2 [now, run] 1 2 [store] 1 2 [reaction] 1 2 TABLE 2 Example of Term Weighting Calculation for Document Term Phrase ID Doc ID Term Array Tf Idf Tf-Idf TermPhrase1 Dok5 [microsoft, sai] TermPhrase10 Dok5 [now, run] TermPhrase100 Dok5 [store] TermPhrase101 Dok5 [reaction] TABLE 3 Example of Term Weighting Calculation for Query Term Phrase ID Query ID Term Array Tf Idf Tf-Idf TermPhrase148 Query1 [googl, map] TermPhrase149 Query1 [map, applic] TermPhrase150 Query1 [applic] TermPhrase663 Query10 [yahoo, ] Cosine Similarity Calculation At this stage, the cosine similarity calculation is done for each query to each document using equation 3. Table 4 is an example of the cosine similarity calculation to the query 1. No. TABLE 4 Calculation Result of Cosine Similarity for Query 1 Cosine Similarity ID Doc ID Query ID Value of Cosine Similarity 1. Cosine1 Dok1 Query Cosine2 Dok2 Query Cosine3 Dok3 Query Cosine4 Dok4 Query Cosine5 Dok5 Query
7 System Testing In the testing stage, we used as many as 100 documents and queries as many as 20. On the implementation of the testing, there are two scenarios that have been done. The first scenario is entering 100.txt documents through the feature for input document. The second scenario is entering 20 queries through the search feature. Then the system will display the document, the search results, serially on the table. However, the system only displays the document with cosine similarity value greater than 0. All queries which is used in this stage and documents that is resulted by the system are shown in Table 5. TABLE 5 Query for Testing and Document Search Results of System Query ID Query Document ID of Search Results Amount Query 1 Google maps application 3, 51, 40, 65 4 Query 2 Marshmallow android mobile device 21, 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 33 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, 82 Query 3 Microsoft s Windows 10 78, 81 2 Query 4 New Google s logo 33, 54 2 Query 5 Smartphone for selfie lovers 5,86, 42, 16, 52, 75 6 Query 6 Smartphone 4.5G 60 1 Query 7 Online streaming music 61 1 Query 8 New smartwatch - 0 Query 9 Payroll applications 100, 74 2 Query 10 Yahoo s - 0 Query 11 Dislike button on Facebook 25, 41, 63, 9, 12, 5, 14, 8, 15, Query 12 New AADC sticker 96, 63 2 Query 13 Mobile device innovation 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 17 28, 86, 72, 99, 82 Query 14 New features in IPhone 6 2, 26, 29, 23, 35, 83, 63, 24, 51 9 Query 15 Mobile payments 88, 54, 34, 69 4 Query 16 Mobile operating system 73, 36, 37, 23, 71, 80, 72, 78, 26, 83, 54, Query 17 Cyber hacker 87, 73, 89 3 Query 18 Cloud computing 27, 81 2 Query 19 Smartphone market 44, 59, 30, 45, 23, 28, 73 7 Query 20 Messenger apps 43, 63, 8 3 System Evaluation System evaluation stage is divided into two stages, namely the determination of the relevant documents and the calculation of the level of success in the search system. Determination of Relevant Documents Determination of the relevant documents is done by calculating the kappa statistic using judge as many as three people. To be able to know the value of kappa statistic, there are 3 steps that need to be done, namely the determination of the relevant documents, the relevance table creation, and the kappa statistic calculation using equation 4. According to Manning et al (2009), as a rule of thumb, a kappa value above 0.8 is taken as good agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement below 0.67 is seen as data providing a dubious basis for an evaluation. P( A) P( E) kappa (4) 1 P( E) From the calculations that have been done, we obtained an average value of kappa statistic equals It can be said that the relevant documents according to the judge 1, 2, and 3 can be used for evaluation. Table 6 shows the results of calculation value of kappa statistic whereas Table 7 shows relevant documents by the judges
8 TABLE 6 Calculation Results of Kappa Statistic Query Judge 1 and Judge 2 Judge 1 and Judge 3 Judge2 and Judge 3 Average Average TABLE 7 Relevant Documents List Query Judge 1 Judge 2 Judge 3 Relevant Amount Documents 1 1, 3, 40, 58 1, 3, 32, 40, 58 3, 32, 40 1, 3, 32, 40, , 35, 78, 80, 83 35, 78, 80, 82, 83 11, 17, 35, 78, 35, 78, 80, 82, , 82, , 41 33, , , 39 5, , , , 99 61, 99 61, 99 61, , 36 34, 36 34, 36 34, , 88, , , 50, 77 48, 50, , 50, , 25 6, 25 6, 25 6, , 18, 19, 20, 22, 27, 28, 44, 53, 59, 65, 73, 92, 93, 95, 98 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 95 19, 21, 30, 92, 93, 95 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, , 26 24, 26 24, 26 24, , 88, , 21, 45, 46 11, 45, , , 23, 56, 62, 87, 7, 23, 56, 62, 87, 56, 62, 73, 87, 7, 23, 56, 62, 87, 7 89, 97 89, 97 89, 97 89, , , , 44, 55 21, 30, 44, 55 44, 55 30, 44, , 13, 43, 49 4, 13, 43, , 13,
9 System Success Rate Calculation The success rate of the system can be seen from the calculation of precision, recall, and F-measure. There are three steps that must be carried out, comparing the document of system search result with relevant documents, such as Table 8, calculate the value of precision and recall using equations 5 and 6 (O'Sullivan et al., 2010), as well as calculating the F-measure value using the equation 7 (Büttcher et al., 2010; Manning et al., 2009; Zaman and Winarko, 2011). R( E) R( A) precision (5) R( A) R( E) R( A) recall (6) R( E) PR F 2 (7) P R Where R(E) is the set of relevant document, R(A) is the set of document search result by system, P is precision, and R is recall. The result of the calculation precision, recall, and F-measure to document results are shown in Table 9. Query Relevant Documents TABLE 8 Comparison of Relevant Documents to System Search Results Document Search Results Relevant Documents Document Search Results Amount (Relevant Documents Document Search Results) 1 1, 3, 32, 40, 58 3, 51, 40, 65 3, , 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, , 78, 80, 82, 78, , 41 33, , 39 5,86, 42, 16, 52, , , , , 50, , 25 25, 41, 63, 9, 12, 5, 14, 8, , , , 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, 95 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 28, 86, 72, 99, 82 28, , 26 2, 26, 29, 23, 35, 83, 63, 24, 24, , 54, 34, , 45 73, 36, 37, 23, 71, 80, 72, , 26, 83, 54, , 23, 56, 62, 87, 73, 89 87, , 89, , , , 44, 55 44, 59, 30, 45, 23, 28, 73 30, , 13, 43 43, 63,
10 TABLE 9 Calculation Results of Precision, Recall, and F-measure Query Precision Recall F-measure Average From Table 9 it can be seen that the average value of precision is smaller than average value of recall. This is caused by the number of documents search results performed by the system more than the number of relevant documents produced by the system slightly. In addition, from the average of F-measure, it can be said that the level of success in the search system is low. CONCLUSION From the research that has been done, we obtained some conclusions as follows: The system displays the outcome document in sequence based on the cosine similarity, from the biggest to the smallest value. Documents produced by the system is a document with cosine similarity value greater than 0. Based on the evaluation, the degree of relevance of the document resulting from the search process using the common phrase index is low. It can be seen from the calculation of precision, recall, and F-measure, that are respectively by 0.37, 0.50, and REFERENCES 1. Bahle, D., Williams, H. E., and Zobel, J. (2002). Efficient Phrase Querying with an Auxiliary Index. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. 2. Büttcher, S., Clarke, C. L., and Cormack, G. V. (2010). Information Retrieval Implementing and Evaluating Search Engines. Massachusetts: The MIT Press. 3. Chang, M., and Poon, C. K. (2007). Efficient Phrase Querying with Common Phrase Index. Information Processing and Management 44 (2008), Manning, C. D., Raghavan, P., and Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge, England: Cambridge University Press. 5. Mao, W., and Chu, W. W. (2006). The Phrase-Based Vector Space Model for Automatic Retrieval of Free-Text Medical Documents. Data & Knowledge Engineering 61 (2007), O Sullivan, D. M., Wilk, S. A., Michalowski, W. J., and Farion, K. J. (2010). Automatic Indexing and Retrieval of Encounter-Specific Evidence for Point-of-Care Support. Journal of Biomedical Informatics 43 (2010), Patterson, K., Watters, C., and Shepherd, M. (2008). Document Retrieval using Proximity-Based Phrase Searching. Proceedings of the 41st Hawaii International Conference on System Sciences. IEEE. 8. Oxford Dictionaries online. retrieved on 19 November Zaman, B., & Winarko, E. (2011). Analisis Fitur Kalimat untuk Peringkas Teks Otomatis pada Bahasa Indonesia. IJCCS, Vol.5 No.2, Juli, 2011,
CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationCustom IDF weights for boosting the relevancy of retrieved documents in textual retrieval
Annals of the University of Craiova, Mathematics and Computer Science Series Volume 44(2), 2017, Pages 238 248 ISSN: 1223-6934 Custom IDF weights for boosting the relevancy of retrieved documents in textual
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationThe Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering
The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering Wiranto Informatics Department Sebelas Maret University Surakarta, Indonesia Edi Winarko Department of Computer
More informationAdaptive Model of Personalized Searches using Query Expansion and Ant Colony Optimization in the Digital Library
International Conference on Information Systems for Business Competitiveness (ICISBC 2013) 90 Adaptive Model of Personalized Searches using and Ant Colony Optimization in the Digital Library Wahyu Sulistiyo
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationExtracting Summary from Documents Using K-Mean Clustering Algorithm
Extracting Summary from Documents Using K-Mean Clustering Algorithm Manjula.K.S 1, Sarvar Begum 2, D. Venkata Swetha Ramana 3 Student, CSE, RYMEC, Bellary, India 1 Student, CSE, RYMEC, Bellary, India 2
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationDocument Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document
Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document Andreas Handojo, Adi Wibowo, Yovita Ria Informatics Engineering Department Faculty of Industrial Technology,
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIntroduction & Administrivia
Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,
More informationInformation Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007
Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents
More informationDepartment of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _
COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationEfficient query processing
Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:
More informationComponent ranking and Automatic Query Refinement for XML Retrieval
Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge
More informationExam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationHow Does a Search Engine Work? Part 1
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling
More informationQuery-Free News Search
Query-Free News Search by Monika Henzinger, Bay-Wei Chang, Sergey Brin - Google Inc. Brian Milch - UC Berkeley presented by Martin Klein, Santosh Vuppala {mklein, svuppala}@cs.odu.edu ODU, Norfolk, 03/21/2007
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationINLS W: Information Retrieval Systems Design and Implementation. Fall 2009.
INLS 490-154W: Information Retrieval Systems Design and Implementation. Fall 2009. 3. Learning to index Chirag Shah School of Information & Library Science (SILS) UNC Chapel Hill NC 27599 chirag@unc.edu
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationFull-Text Indexing For Heritrix
Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design
More informationA short introduction to the development and evaluation of Indexing systems
A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main
More informationSentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language
Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language Nur Maulidiah Elfajr, Riyananto Sarno Department of Informatics, Faculty of Information and Communication Technology
More informationNoida institute of engineering and technology,greater noida
Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationXML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson
Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationInferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets
2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationLarge Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao
Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese
More informationA Distributed Retrieval System for NTCIR-5 Patent Retrieval Task
A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task Hiroki Tanioka Kenichi Yamamoto Justsystem Corporation Brains Park Tokushima-shi, Tokushima 771-0189, Japan {hiroki tanioka, kenichi yamamoto}@justsystem.co.jp
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationRelevance of a Document to a Query
Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationIndex Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES
ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES Shaufiah, Imanudin and Ibnu Asror Mining Center Laboratory, School of Computing, Telkom University, Bandung, Indonesia E-Mail:
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationWeb Page Similarity Searching Based on Web Content
Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationFACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES
FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES Swathi Polamraju and Sricharan Ramagiri Department of Electrical and Computer Engineering Clemson University ABSTRACT: Being motivated by the
More informationText Pre-processing and Faster Query Processing
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Administrative Everyone have CS lab accounts/access?
More informationContents 1. INTRODUCTION... 3
Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...
More informationMeliper: Making News Personal
Meliper: Making News Personal Arden Dertat Department of Computer Science Brown University Providence, RI 02912 arden@cs.brown.edu Abstract Getting fresh and relevant news about our interest topics is
More informationAn Adaptive Agent for Web Exploration Based on Concept Hierarchies
An Adaptive Agent for Web Exploration Based on Concept Hierarchies Scott Parent, Bamshad Mobasher, Steve Lytinen School of Computer Science, Telecommunication and Information Systems DePaul University
More informationInformation Retrieval
Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance
More informationWhat is this Song About?: Identification of Keywords in Bollywood Lyrics
What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics
More informationA Security Model for Multi-User File System Search. in Multi-User Environments
A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada December 15, 2005 1 Introduction and Motivation 2 3 4 5
More informationCS371R: Final Exam Dec. 18, 2017
CS371R: Final Exam Dec. 18, 2017 NAME: This exam has 11 problems and 16 pages. Before beginning, be sure your exam is complete. In order to maximize your chance of getting partial credit, show all of your
More informationTEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION
TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationUsing Query History to Prune Query Results
Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu
More informationJava Archives Search Engine Using Byte Code as Information Source
Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id
More informationEFFECTIVE EFFICIENT BOOLEAN RETRIEVAL
EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL J Naveen Kumar 1, Dr. M. Janga Reddy 2 1 jnaveenkumar6@gmail.com, 2 pricipalcmrit@gmail.com 1 M.Tech Student, Department of Computer Science, CMR Institute of Technology,
More informationdr.ir. D. Hiemstra dr. P.E. van der Vet
dr.ir. D. Hiemstra dr. P.E. van der Vet Abstract Over the last 20 years genomics research has gained a lot of interest. Every year millions of articles are published and stored in databases. Researchers
More informationdoi: / _32
doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationMultimodal Information Spaces for Content-based Image Retrieval
Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due
More informationIndexing and Query Processing
Indexing and Query Processing Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu January 28, 2013 Basic Information Retrieval Process doc doc doc doc doc information need document representation
More informationDocument Retrieval using Predication Similarity
Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research
More informationImplementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia
Vol8/No1 (2016) INERNEWORKING INDONESIA JOURNAL 83 Implementation of LSI Method on Information Retrieval for ext in Bahasa Indonesia Jasman Pardede and Mira Musrini Barmawi Abstract Information retrieval
More informationCANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM
CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna
More informationApproach Research of Keyword Extraction Based on Web Pages Document
2017 3rd International Conference on Electronic Information Technology and Intellectualization (ICEITI 2017) ISBN: 978-1-60595-512-4 Approach Research Keyword Extraction Based on Web Pages Document Yangxin
More informationInformation Retrieval
Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing
More informationThe Goal of this Document. Where to Start?
A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationText Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering
Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationRetrieval of Highly Related Documents Containing Gene-Disease Association
Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationUsing NLP and context for improved search result in specialized search engines
Mälardalen University School of Innovation Design and Engineering Västerås, Sweden Thesis for the Degree of Bachelor of Science in Computer Science DVA331 Using NLP and context for improved search result
More information