Implementation of the common phrase index method on the phrase query for information retrieval

Implementation of the common phrase index method on the phrase query for information retrieval Triyah Fatmawati, Badrus Zaman, and Indah Werdiningsih Citation: AIP Conference Proceedings 1867, 020027 (2017); View online: https://doi.org/10.1063/1.4994430 View Table of Contents: http://aip.scitation.org/toc/apc/1867/1 Published by the American Institute of Physics Articles you may be interested in Categorizing document by fuzzy C-Means and K-nearest neighbors approach AIP Conference Proceedings 1867, 020012 (2017); 10.1063/1.4994415 Application of wavelet transformation and adaptive neighborhood based modified backpropagation (ANMBP) for classification of brain cancer AIP Conference Proceedings 1867, 020004 (2017); 10.1063/1.4994407 3D multiplayer virtual pets game using Google Card Board AIP Conference Proceedings 1867, 020018 (2017); 10.1063/1.4994421 An artificial immune system algorithm approach for reconfiguring distribution network AIP Conference Proceedings 1867, 020017 (2017); 10.1063/1.4994420 Be-safe travel, a web-based geographic application to explore safe-route in an area AIP Conference Proceedings 1867, 020023 (2017); 10.1063/1.4994426 Constrained H control for low bandwidth active suspensions AIP Conference Proceedings 1867, 020032 (2017); 10.1063/1.4994435

Implementation of The Common Phrase Index Method on The Phrase Query for Information Retrieval Triyah Fatmawati a), Badrus Zaman b), Indah Werdiningsih c) Information System Study Program, Faculty of Science and Technology, Airlangga University, Surabaya, Indonesia a) triyah.fatmawati-12@fst.unair.ac.id b) badruszaman@fst.unair.ac.id c) indahwerdiningsih@fst.unair.ac.id Abstract. As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F- measure of 0.43. From this result can be said that the success rate of the system to produce relevant documents is low. INTRODUCTION Based on Oxford Dictionaries, phrase is a small group of words standing together as a conceptual unit, typically forming a component of a clause. The arrangement positions of phrases can cause different meanings (Mao et al., 2006). This will have an impact on the accuracy of the information obtained. The phrase is generally used in a sentence, as in the news text. News is the source of the information needed to determine the events that are happening. As the development of technology, the process of finding information on the news text is easy, because the news text is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. Phrase query that is entered by user can be composed of two or more words. According to Patterson et al (2008), the rise in the complexity of the query may cause increased possibility of error. As a result, the accuracy of the information obtained will be affected. In addition, the word position of the query must be considered. This is because the different word position of the query can cause different meanings, so that the information obtained from the search process will also be different. Therefore, we need a method to solve the problem of phrase query, that is indexing. Indexing is the main process in the information retrieval system (Mao et al., 2006). International Conference on Mathematics: Pure, Applied and Computation AIP Conf. Proc. 1867, 020027-1 020027-9; doi: 10.1063/1.4994430 Published by AIP Publishing. 978-0-7354-1547-8/$30.00 020027-1

Indexing can be done by several methods, including inverted index, auxiliary nextword index, and the common phrase index (Manning et al., 2009; Bahle et al., 2002; Chang and Poon, 2007). Inverted index or inverted file is a basic concept in information retrieval (Manning et al., 2009). On implementation, each sentence in the document is broken down into term. The advantage of this method is providing fast searching through an enomous amount of document (Chang and Poon, 2007). In contrast, for phrase querying it is not simple to predict occurences of the term will be in a query phrase, and thus such reordering is unlikely to be effective (Bahle et al., 2002). Auxiliary nextword index is an indexing method that could reduce costs and the requirement for accessing a disk which can increase efficiency, but the auxiliary nextword index method can only work very well on a phrase query size of two (Chang and Poon, 2007). Common phrase index is an alternative for phrase query optimization (Chang and Poon, 2007). The advantages of this method is to have a vocabulary like tree structure, so that it can be used on a query that has a size of more than two. Considered from the advantages of each method, the method that will be used in the research is the common phrase index. According to Chang and Poon (2007), each word in the common phrase index method are grouped into common and rare word. Common words are words that has the high frequency. Conversely, rare words are words that have small frequency. Unlike the auxiliary nextword index, a common phrase index method has tree-like structure, which is composed of the roots to the leaves. Each structure in the common phrase index represents the phrase, which is started from the common word and terminated by the terminal word. Based on the outlined problem, then the purpose of this research is to analyze the implementation of the common phrase index method on information retrieval. The research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. RESEARCH METHODS The research on the implementation of common phrase index method on the phrase query for information retrieval will be done with 4 main steps, they are pre-processing, indexing, term weighting, and cosine similarity. The steps in this research is shown in Figure 1. FIGURE 1. Research Method 020027-2

Pre-Processing The pre-processing stage is done every time a document and query is inserted into the system. In this stage, there are two steps taken, namely tokenizing and stemming. Tokenizing is the process of breaking the sentence into words. Stemming is the mapping process of decomposition from various word forms, both prefix, suffix, and the joint of prefix and suffix (confix), to basic word (stem) (Zaman and Winarko, 2011). For English text, the algorithm that is used is Porter Stemmer algorithm. Indexing Before indexing process, the step that is conducted is determine the common word. In the process of determining the common word, each term that has been formed will be calculated for the frequency of occurrence in all documents and add up the total frequency. The total number of frequencies is then used to determine the threshold as a common word, which is limited by calculating the chance appearance of each term and cumulative number. In this research, the determination of the common word is limited by cumulative total of 75%. At the indexing stage, every word formed from the pre-processing stage will be matched with the common word list. If a word is included in the common word list, then the word will be merged with adjacent word. Merger process will stop when it has met terminal word, that is verb, noun, or adjective. However, if a word is not included in the common word list, it will not be merged. Term Weighting Calculation Term weighting calculation is performed on each term to each document and query using equation 1 (Manning et al., 2009). tf idft, d tf t, d idft (1) Tf is obtained by calculating the frequency of occurrence of the term, indexing result, both in the documents and queries. While idf can be calculated using equation 2, where df is the calculation result of the total documents which are containing that term. N idft log (2) df Cosine Similarity Calculation Cosine similarity calculation can be done using equation 3 (Manning et al., 2009). Cosine similarity is used to see the similarity between query and corpus that has been stored. v( q) v( d) score( q, d) (3) v( q) v( d) RESULTS AND DISCUSSION Data Collection Data that is used is English news articles, about the information technology, with a simple structure that is taken from the online news site of The Jakarta Post. News articles collected were 100 documents that were downloaded in December 2015. These documents are then used to test the system in order to determine the level of relevance of search results. Figure 2 shows an example document. t 020027-3

FIGURE 2. Examples of Documents Pre-Processing In the pre-processing phase, the steps taken are tokenizing and stemming. Tokenizing is the process of splitting each document and query into words. In tokenizing process, the steps that are performed by the system are altering uppercase to lowercase, eliminating characters (punctuation), and detects spaces contained in the document and query to be a splitting solution. In stemming process, the system will remove the prefixes and suffixes using Porter Stemmer algorithm. Indexing The indexing process is carried out using a common phrase index method. The indexing algorithm is shown in Figure 3. In practice, this method requires a common and terminal word list. The common word list can be changed every time a new document is inserted into the system. The terminal word list won t be changed and only loads verb, noun, and adjective list. FIGURE 3. Indexing Algorithm Using Common Phrase Index 020027-4

In the indexing process, every word formed from pre-processing stage, matched with the common word list. If a word is included in the list of common words, then the word will be merged with adjacent word. Merger process will stop when meeting with a terminal word, so it will be formed a term array which is composed by two or more words. However, if a word is not included in the list of common words, it will not be merged, so it will only be formed a single term. Term Weighting Calculation In general, the process of calculating the weighting term is divided into four stages, namely calculation of term frequencies, document frequency, inverse document frequency, and term weighting. The first stage is to calculate the term frequency (tf) of each term array indexing results, both documents and queries. Each time a document or query is entered, the system will calculate all of tf of the term array of documents or the query. If the system receives another document or query, the calculation of tf of each term array do from start. The second step is calculating the document frequency (df) on each term array. In this calculation, any new documents entered into the system, the system will add df if in the document there are term arrays, which is same with an term arrays contained in previous documents. The third step is to calculate the inverse document frequency ( ). Idf calculation can be done with the equation 2. Example calculation of df and idf the sample document is shown in Table 1. The fourth step is to calculate the value of term weighting (tf-idf) of each term arrays on each document and query. Tf-idf calculation can be performed using Equation 1. Example tf-idf calculation results for the document is shown in Table 2. As an example the calculation results for the query tf-idf are shown in Table 3. TABLE 1 Calculation Results of Df and Idf Term Array Df Idf [microsoft, sai] 1 2 [now, run] 1 2 [store] 1 2 [reaction] 1 2 TABLE 2 Example of Term Weighting Calculation for Document Term Phrase ID Doc ID Term Array Tf Idf Tf-Idf TermPhrase1 Dok5 [microsoft, sai] 4 2 8 TermPhrase10 Dok5 [now, run] 2 2 4 TermPhrase100 Dok5 [store] 1 2 2 TermPhrase101 Dok5 [reaction] 1 2 2 TABLE 3 Example of Term Weighting Calculation for Query Term Phrase ID Query ID Term Array Tf Idf Tf-Idf TermPhrase148 Query1 [googl, map] 1 2 2 TermPhrase149 Query1 [map, applic] 1 2 2 TermPhrase150 Query1 [applic] 1 2 2 TermPhrase663 Query10 [yahoo, email] 1 0 0 Cosine Similarity Calculation At this stage, the cosine similarity calculation is done for each query to each document using equation 3. Table 4 is an example of the cosine similarity calculation to the query 1. No. TABLE 4 Calculation Result of Cosine Similarity for Query 1 Cosine Similarity ID Doc ID Query ID Value of Cosine Similarity 1. Cosine1 Dok1 Query1 0.27 2. Cosine2 Dok2 Query1 0 3. Cosine3 Dok3 Query1 0 4. Cosine4 Dok4 Query1 0 5. Cosine5 Dok5 Query1 0 020027-5

System Testing In the testing stage, we used as many as 100 documents and queries as many as 20. On the implementation of the testing, there are two scenarios that have been done. The first scenario is entering 100.txt documents through the feature for input document. The second scenario is entering 20 queries through the search feature. Then the system will display the document, the search results, serially on the table. However, the system only displays the document with cosine similarity value greater than 0. All queries which is used in this stage and documents that is resulted by the system are shown in Table 5. TABLE 5 Query for Testing and Document Search Results of System Query ID Query Document ID of Search Results Amount Query 1 Google maps application 3, 51, 40, 65 4 Query 2 Marshmallow android mobile device 21, 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 33 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, 82 Query 3 Microsoft s Windows 10 78, 81 2 Query 4 New Google s logo 33, 54 2 Query 5 Smartphone for selfie lovers 5,86, 42, 16, 52, 75 6 Query 6 Smartphone 4.5G 60 1 Query 7 Online streaming music 61 1 Query 8 New smartwatch - 0 Query 9 Payroll applications 100, 74 2 Query 10 Yahoo s email - 0 Query 11 Dislike button on Facebook 25, 41, 63, 9, 12, 5, 14, 8, 15, 11 10 Query 12 New AADC sticker 96, 63 2 Query 13 Mobile device innovation 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 17 28, 86, 72, 99, 82 Query 14 New features in IPhone 6 2, 26, 29, 23, 35, 83, 63, 24, 51 9 Query 15 Mobile payments 88, 54, 34, 69 4 Query 16 Mobile operating system 73, 36, 37, 23, 71, 80, 72, 78, 26, 83, 54, 82 12 Query 17 Cyber hacker 87, 73, 89 3 Query 18 Cloud computing 27, 81 2 Query 19 Smartphone market 44, 59, 30, 45, 23, 28, 73 7 Query 20 Messenger apps 43, 63, 8 3 System Evaluation System evaluation stage is divided into two stages, namely the determination of the relevant documents and the calculation of the level of success in the search system. Determination of Relevant Documents Determination of the relevant documents is done by calculating the kappa statistic using judge as many as three people. To be able to know the value of kappa statistic, there are 3 steps that need to be done, namely the determination of the relevant documents, the relevance table creation, and the kappa statistic calculation using equation 4. According to Manning et al (2009), as a rule of thumb, a kappa value above 0.8 is taken as good agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement below 0.67 is seen as data providing a dubious basis for an evaluation. P( A) P( E) kappa (4) 1 P( E) From the calculations that have been done, we obtained an average value of kappa statistic equals 0.71. It can be said that the relevant documents according to the judge 1, 2, and 3 can be used for evaluation. Table 6 shows the results of calculation value of kappa statistic whereas Table 7 shows relevant documents by the judges. 020027-6

TABLE 6 Calculation Results of Kappa Statistic Query Judge 1 and Judge 2 Judge 1 and Judge 3 Judge2 and Judge 3 Average 1 0.883653287 0.55588453 0.739583333 0.726373717 2 1 1 1 1 3 0.789473684 0.645390071 0.822695035 0.752519597 4 1 0.661590525 0.661590525 0.774393683 5 1 0.661590525 0.661590525 0.774393683 6 0.661590525 1 0.661590525 0.774393683 7 1 1 1 1 8 1 1 1 1 9 0.489795918 0.384615385 0.661590525 0.512000609 10 1-0.020408163-0.020408163 0.319727891 11 1 1 1 1 12 1 1 1 1 13 0.785913081 0.284984678 0.292831887 0.454576549 14 1 1 1 1 15 0.489795918-0.005025126-0.015228426 0.156514122 16 0.55588453 0.384615385 0.489795918 0.476765278 17 1 0.753187988 0.753187988 0.835458659 18 0.661590525-0.01010101 0.661590525 0.437693346 19 0.85196151 0.794871795 0.656357388 0.767730231 20 0.739583333 0.384615385 0.384615385 0.502938034 Average 0.713273954 TABLE 7 Relevant Documents List Query Judge 1 Judge 2 Judge 3 Relevant Amount Documents 1 1, 3, 40, 58 1, 3, 32, 40, 58 3, 32, 40 1, 3, 32, 40, 58 5 2 2 2 2 2 1 3 32, 35, 78, 80, 83 35, 78, 80, 82, 83 11, 17, 35, 78, 35, 78, 80, 82, 83 5 80, 82, 83 4 33, 41 33, 41 33 33, 41 2 5 5, 39 5, 39 5 5, 39 2 6 60 60, 98 60 60 1 7 61, 99 61, 99 61, 99 61, 99 2 8 34, 36 34, 36 34, 36 34, 36 2 9 74, 88, 100 74 54, 74 74 1 10 48, 50, 77 48, 50, 77 13 48, 50, 77 3 11 6, 25 6, 25 6, 25 6, 25 2 12 96 96 96 96 1 13 17, 18, 19, 20, 22, 27, 28, 44, 53, 59, 65, 73, 92, 93, 95, 98 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 95 19, 21, 30, 92, 93, 95 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, 95 14 24, 26 24, 26 24, 26 24, 26 2 15 54 54, 88, 100-54 1 16 11, 21, 45, 46 11, 45, 53 45 11, 45 2 17 7, 23, 56, 62, 87, 7, 23, 56, 62, 87, 56, 62, 73, 87, 7, 23, 56, 62, 87, 7 89, 97 89, 97 89, 97 89, 97 18 15 15, 100 100 15, 100 2 19 30, 44, 55 21, 30, 44, 55 44, 55 30, 44, 55 3 20 4, 13, 43, 49 4, 13, 43, 46 43 4, 13, 43 3 12 020027-7

System Success Rate Calculation The success rate of the system can be seen from the calculation of precision, recall, and F-measure. There are three steps that must be carried out, comparing the document of system search result with relevant documents, such as Table 8, calculate the value of precision and recall using equations 5 and 6 (O'Sullivan et al., 2010), as well as calculating the F-measure value using the equation 7 (Büttcher et al., 2010; Manning et al., 2009; Zaman and Winarko, 2011). R( E) R( A) precision (5) R( A) R( E) R( A) recall (6) R( E) PR F 2 (7) P R Where R(E) is the set of relevant document, R(A) is the set of document search result by system, P is precision, and R is recall. The result of the calculation precision, recall, and F-measure to document results are shown in Table 9. Query Relevant Documents TABLE 8 Comparison of Relevant Documents to System Search Results Document Search Results Relevant Documents Document Search Results Amount (Relevant Documents Document Search Results) 1 1, 3, 32, 40, 58 3, 51, 40, 65 3, 40 2 2 2 21, 2, 19, 45, 8, 43, 5, 11, 13, 37, 88, 34, 90, 36, 72, 14, 71, 29, 7, 39, 31, 15, 83, 38, 58, 80, 17, 42, 28, 73, 86, 99, 82 2 1 3 35, 78, 80, 82, 78, 81 78 1 83 4 33, 41 33, 54 33 1 5 5, 39 5,86, 42, 16, 52, 75 5 1 6 60 60 60 1 7 61, 99 61 61 1 8 34, 36 - - 0 9 74 100, 74 74 1 10 48, 50, 77 - - 0 11 6, 25 25, 41, 63, 9, 12, 5, 14, 8, 25 1 15, 11 12 96 96, 63 96 1 13 17, 18, 19, 20, 22, 27, 28, 65, 73, 92, 93, 95 73, 45, 43, 36, 29, 39, 31, 83, 38, 58, 80, 42, 28, 86, 72, 99, 82 28, 73 2 14 24, 26 2, 26, 29, 23, 35, 83, 63, 24, 24, 26 2 51 15 54 88, 54, 34, 69 54 1 16 11, 45 73, 36, 37, 23, 71, 80, 72, - 0 78, 26, 83, 54, 82 17 7, 23, 56, 62, 87, 73, 89 87, 89 2 87, 89, 97 18 15, 100 27, 81-0 19 30, 44, 55 44, 59, 30, 45, 23, 28, 73 30, 44 2 20 4, 13, 43 43, 63, 8 43 1 020027-8

TABLE 9 Calculation Results of Precision, Recall, and F-measure Query Precision Recall F-measure 1 0.5 0.4 0.444444444 2 0.030303 1 0.058823529 3 0.5 0.2 0.285714286 4 0.5 0.5 0.5 5 0.1666667 0.5 0.25 6 1 1 1 7 1 0.5 0.666666667 8-0 - 9 0.5 1 0.666666667 10-0 - 11 0.1 0.5 0.166666667 12 0.5 1 0.666666667 13 0.1176471 0.166667 0.137931034 14 0.2222222 1 0.363636364 15 0.25 1 0.4 16 0 0-17 0.6666667 0.285714 0.4 18 0 0-19 0.2857143 0.666667 0.4 20 0.3333333 0.333333 0.333333333 Average 0.3706974 0.502619 0.4266943 From Table 9 it can be seen that the average value of precision is smaller than average value of recall. This is caused by the number of documents search results performed by the system more than the number of relevant documents produced by the system slightly. In addition, from the average of F-measure, it can be said that the level of success in the search system is low. CONCLUSION From the research that has been done, we obtained some conclusions as follows: The system displays the outcome document in sequence based on the cosine similarity, from the biggest to the smallest value. Documents produced by the system is a document with cosine similarity value greater than 0. Based on the evaluation, the degree of relevance of the document resulting from the search process using the common phrase index is low. It can be seen from the calculation of precision, recall, and F-measure, that are respectively by 0.37, 0.50, and 0.43. REFERENCES 1. Bahle, D., Williams, H. E., and Zobel, J. (2002). Efficient Phrase Querying with an Auxiliary Index. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. 2. Büttcher, S., Clarke, C. L., and Cormack, G. V. (2010). Information Retrieval Implementing and Evaluating Search Engines. Massachusetts: The MIT Press. 3. Chang, M., and Poon, C. K. (2007). Efficient Phrase Querying with Common Phrase Index. Information Processing and Management 44 (2008), 756 769. 4. Manning, C. D., Raghavan, P., and Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge, England: Cambridge University Press. 5. Mao, W., and Chu, W. W. (2006). The Phrase-Based Vector Space Model for Automatic Retrieval of Free-Text Medical Documents. Data & Knowledge Engineering 61 (2007), 76 92. 6. O Sullivan, D. M., Wilk, S. A., Michalowski, W. J., and Farion, K. J. (2010). Automatic Indexing and Retrieval of Encounter-Specific Evidence for Point-of-Care Support. Journal of Biomedical Informatics 43 (2010), 623 631. 7. Patterson, K., Watters, C., and Shepherd, M. (2008). Document Retrieval using Proximity-Based Phrase Searching. Proceedings of the 41st Hawaii International Conference on System Sciences. IEEE. 8. Oxford Dictionaries online. http://www.oxforddictionaries.com, retrieved on 19 November 2015 9. Zaman, B., & Winarko, E. (2011). Analisis Fitur Kalimat untuk Peringkas Teks Otomatis pada Bahasa Indonesia. IJCCS, Vol.5 No.2, Juli, 2011, 60-68. 020027-9