New Concept based Indexing Technique for Search Engine

Size: px

Start display at page:

Download "New Concept based Indexing Technique for Search Engine"

Chester Sullivan
5 years ago
Views:

1 Indian Journal of Science and Technology, Vol 10(18), DOI: /ijst/2017/v10i18/114018, May 2017 ISSN (Print) : ISSN (Online) : New Concept based Indexing Technique for Search Engine Sangita Karmakar and Soumen Swarnakar* Department of Information Technology, Netaji Subhash Engineering College, Kolkata ,West Bengal, India; sangitakarmakar1995@gmail.com, soumen_swarnakar@yahoo.co.in Abstract Objectives: To find a better indexing method of search engine for better information retrieval. Methods/Statistical Analysis: Indexing technique is very important part of any search engine or information retrieval system. There are many indexing techniques proposed earlier but they are not accurately retrieving information from the database. In this paper we are trying to propose a new indexing technique for maximizing attempt to get proper information according to search query by the information retrieval system. There are many different parts to a search engine index, such as design factors and data structures. When a search engine index is being built, there are also many different types of data structures to choose from. Some well known data structures are suffix tree, tree, inverted index, Citation index, Ngram index, Term document matrix, which are all used for different type of index designing for search engine. In this paper combination of different existing indexing methods has been used to form a new indexing technique. Findings: The experimental results described in the paper show that the accuracy of search results using S-N Indexing methods is 5% better than existing search engine indexing techniques. Application/Improvements: In this paper we are trying to design an indexing technique which is combination of Ngram index and suffix tree index for better information retrieval process by search engines. So, proposed method of indexing technique can improve searching results with more accurate and more related way. Keywords: Concept based, Concept Based Index, Improved Search Index, New Index, Search Engine, Search Engine Index 1. Introduction Every search engine has own indexing technique for better search result within minimum search time. The purpose of the storing an Index in any information retrieval system is to optimize the speed and performance in finding relevant documents for search query. In any search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. The actual search engine index is the place where all the data the search engine has collected is stored. It is the search engine index that provides the results for search queries, and pages that are stored within the search engine index that appear on the search engine results page. Without a search engine index, the search engine would take considerable amounts of time and effort each time a search query was initiated by user. There are many different parts of a search engine index, such as design factors and data structures. Design procedure of search engine index decides the internal architecture of the search engine and how the index will actually work. Search engine indexing technique was developed by many researchers for better search result in information retrieval process. Website security leak in search engines is discussed 1. In this paper authors discussed the importance of website information security of different search engine. Document clustering is the key concept of any information retrieval system. In 2 a new approach to concept based document clustering has been discussed, where a comparative study done by the authors with hierarchical clustering and new approach. A practical approach of web search engine also proposed 3 where discussion has been done on working principle of web search engine. A Comparative Study of Traditional Search Engines with the *Author for correspondence

2 New Concept based Indexing Technique for Search Engine Meta search engines has been suggested 4, where discussion was done on indexing for better search result according to search query by user. This paper 5 actually deals with term document matrix indexing technique to retrieve the information from the search engine. Suffix tree clustering and data mining algorithm 6 is also helpful to the research work on search engine indexing. A content based ranking for search engines is also done on ranking of the pages when indexing technique is taken place before query processing 7. In 8 an improved Indexing Mechanism has been discussed to Index Web Documents for search engines. A critical review on many searching algorithms for search engine has proposed on hierarchic document clustering 9. An evolutionary algorithm 10 of search engine suggested by the authors for better searching process. An enhanced model of web page prediction 11 using page rank and markov model has been suggested by the authors which is also helpful for search engine based work. The aim of this paper is to introduce a new search indexing technique for web documents. Section 2 describes methodology containing different existing indexing method used for introducing new search engine while section 3 describe proposed model of search engine using new indexing technique. Experimental result has been described in section 4 and the conclusion is summarized in section 5 in this paper. 2. Methodology In this part different terminologies have been used for referencing our proposed work. The terminologies are described below: 2.1 Suffix Tree Index In search engine, data structures one of the first indexes is suffix tree index. It is also known as PAT tree or position tree. It is a compressed trie which is containing all the suffixes of the given the text as their keys and positions in the text as their values. Suffix tree basically used for fast string implementation in indexing technique. Suffix tree is a basically the compressed trie for nonempty suffixes of the string. Since it is referring to the root tree as a trie and to its sub trees as sub tries. Steps to build suffix tree: 1. At first take the string and put the $ symbol end of the string which means that it is the terminal point of the string. Let the string is T. 2. Then build the trie tree which is contained all the keys of the string. 3. After construction of the trie tree then build the compressed trie tree. It contained all suffix from the given text as their keys and positions in the text. 4. After completing the compressed trie tree finally build the suffix tree. In suffix tree each and every leave node holds the offset number which is actually use in indexing to retrieve the information from relational database system. (offset number is the numbering process of the string from start to last position of the string). For example: T is the string contained the abaaba$ string, then offset is represent as below, Let us take an example to better understand the working principle of suffix tree indexing. A string has been taken which is abaaba for our example to have the overview of the suffix tree indexing data structure. The following steps describing the creation of suffix tree: Step 1: T is the string contained the abaaba$ string, then offset is represent as Figure 1. Figure 1. This figure shows the offset of the string represent in T. 2 Indian Journal of Science and Technology

3 Sangita Karmakar and Soumen Swarnakar Step 2: After the construct the suffix from the given string, then the trie tree is build. The trie tree shown in Figure 2. Figure 4. This figure represents the final suffix tree. Figure 2. This figure represents the trie tree from the given suffix. Step 3: After construct trie tree form the suffix, it is too elongated in the nature, so for that reason it is eliminating all branch nodes that have only one child. After elimination of branch nodes produced compressed trie which is used to improve both the time and space performance metrics of a trie. The compressed trie is shown in Figure 3. Figure 3. The compressed trie represent in this figure. Step 4: After construction of compressed trie tree, then the suffix tree is built. In the suffix tree suffix are constructed by the offset and length of the string. So in the figure (0, 1) means given string abaaba$ a is first suffix in the tree and according to the offset numbering it is marked as 0. Length is determined by the suffix position length in the given string. So (i,j) position is refers to offset and length of the given string. And the rectangular box represented final suffix in the tree. The final suffix tree is represented in Figure Ngram index An Ngram index creates a contiguous sequence of n items from a given query sequence of text or speech. Ngram index basically used computational linguistics and probability to build a better search engine index for quick query processing. The item can be phonemes, syllabus, letter, words or base pairs according to the application required by the user. When Ngram index is constructed it clustered from a text or speech corpus, if items are words then it is known as shingles. Ngram index has several sizes of word length and according to that reference it has specific names like if Ngram of size 1 it is called unigram. According to the size it has several names like bigram or digram (size 2), trigram (size 3). Larger sizes often represented by the value of n (item) in modern language. Let us take an example to better understand the working principle of Ngram indexing. Suppose a word has been taken for creating trigram. The word is ELEPHANT, so the trigrams are ELE, LEP, EPH, PHA, HAN, ANT. Now the Ngram indexing working principle is shown in Figure 5. The Flow chart of the Ngram index algorithm has been discussed in Figure 6. At first word is collected by the user query. After that word is divided according to N gram size. Then fuzzy match is occurred according to N gram. After that N gram indexing process is started with the specified position number of the word and also distance is also counted by the distance ranking process. Finally the sorted array of the word is fetched by the exact match and fuzzy match is processed in N gram indexing method. 2.3 Search engine Full text index A single computer stored document in a full text database needs a technique to search document in any information retrieval system, so it is formally known as full text index or full text search. Full text search is differentiated from Indian Journal of Science and Technology 3

4 New Concept based Indexing Technique for Search Engine Figure 5. This figure represents Ngram index working principle with an example. Figure 6. The flow chart of the Ngram index algorithm. 4 Indian Journal of Science and Technology

5 Sangita Karmakar and Soumen Swarnakar metadata search or on parts of the original text reflected in database. The working principle of the full text search is quite different; it is first find all of the words in each every document then it is try to match according to search condition or specification which is defined by the user. In 1990 s full text index was most popular in online bibliographic database. Many websites and application like word processing software supports full text search index technique. Generalized search engine full text indices basically have two major indices. Each of the part is important to search process optimization. The two parts are Document Word lists So better understanding how actually full text index can be created and works in any information retrieval system is shown in Figure Proposed model In existing models of index data structures there are some loopholes, for that reason they are not able to sufficient handle the retrieving process of the information or documents in any information retrieval system. Existing models like suffix tree, Ngram index are implemented with array or structures. In this paper we are trying to implement a better indexing model or data structure which helps to fetch good query result according to concept of the query in fastest way. 3.1 The Architecture of search engine using the Proposed Indexing Technique Search engine architecture depends upon many elements because it is combination of many systems like indexer, crawler or knowledge graph, pre-processor, domain dictionary, query processer etc. So information retrieval process discuss in this section by the S-N structured index as proposed indexing technique in this paper. At first information comes through World Wide Web (WWW) by use of the crawler or spider technology. Now days it is also done by the knowledge graph for fast retrieving process. After fetching documents pre-processor processed the keyword from the each documents. Keywords are taken as the Meta data of the any concept. Then according to the concept of keyword domain dictionary or word net is linked to fetch the documents through concept based index. After that if a query is requested by the user interface then by the use of query processer technology query is proceed. Then S-N structured index is applied to retrieve the document according to the concept of the query. Finally search results are given back to the user. The Architecture of search engine using the Proposed Indexing Technique has been described in Figure 8. The working principle of new indexing named S-N structured indexing technique is described in section S-N structured index In this paper proposed work is mainly done on the index data structures. There are many pre existing index data structure implemented for information retrieval process in any information retrieval system. Suffix tree index, Ngram index, inverted index, Citation index, Term document matrix, are mainly used in any search engine or any information retrieval process. But those indexing process is very complex as well as very time consuming in nature. So for improvement process we introduced a new index data structure which is named as S-N structured index. As the name suggested that it is a combination of the two pre indexing data structure, they are suffix tree index and Ngram index data structure. In this data structure we also used the full text search architecture but in different way. In full text index or search there are two main indices, document and word lists. So in S-N structured index two main indices document and word lists are created by the suffix tree index and Ngram index. Main word list is created by the Ngram index technique for divide the domain according the concept. Document index is created by the suffix tree index for the link tree document structure which is help to quick search process. After created the link tree document structure each and every node is stored in the relational database. After query proceeding by the user S-N structured index checked the domain and its Meta data and according the concept it retrieve the document from the relational database. Steps to build the S-N structured index: a. At first we create the word list according the Ngram index technique. Suppose the word is BANANA, we can divide the any from like shingles, digram, trigrams, etc. In our example we divided the word in digram from. So the digrams are BA, AN, NA, AN, NA. b. After the creating the digrams construct the word list and divided into the domain according the concept of the word and arranged in alphabetical order. c. After arrangement of the word into the domain, we apply the suffix tree index to create the document Indian Journal of Science and Technology 5

6 New Concept based Indexing Technique for Search Engine Figure 7. Generalized structures of search engine full text indices. Figure 8. The Architecture of search engine using the Proposed Indexing Technique. 6 Indian Journal of Science and Technology

7 Sangita Karmakar and Soumen Swarnakar index for actual query processing by the user requirement. d. In the creation of the suffix tree index for better index process we create the link tree document structure by the meat data and the concept of the domain knowledge of the document. e. After creating link tree document structure stores into the relational database for query fetch and information retrieval processing. Example: Let us take an example to better understand the working principle of the S-N structured index technique. Step 1: At first create the word lists according the Ngram index technique. For our example we take the word BANANA and divided it into digrams which are shown in Figure 9. Step 2: After creating digrams by the Ngram index, domain needs to specify according the concept of the word. After the specifying the concept of the word domain is created and arranged in alphabetical order. We take the BA digram to elaborate the S-N structured index. Finally word list is created by the Ngram as shown in Figure 10. Just like the same procedure apply on the other digrams like AN, NA, AN, NA. Creation of S-N structured index has been shown in Figure 12. Step 3: Figure 9. Digrams of the word BANANA. Figure 10. Domain as well as word list created by the Ngram index with BA digram. Indian Journal of Science and Technology 7

8 New Concept based Indexing Technique for Search Engine Figure 11. Link Tree document. After created the word list according to the domain, next step is build by the help of the suffix tree index. Suffix tree index is used here to build the document index which is used for better arrangement of the document. In the S-N structured index technique suffix tree used for document linking process with the word list. So after linking document with the word a link tree document structure is created by the Meta data and the concept of the domain knowledge of the document. Here the concept is like, each metadata has offset and document related to metadata or sub metadata is attached below with the other sub metadata. Below of the list of the documents attached with metadata or submit is shown by (i, j), where i represents the number of the metadata, is referred as offset. And j represents the level no according to the concept of the word relatedness with the domain of metadata. S-N structured index created the (i, j) which is help full for searching efficiently for a particular concept. The figure of the link tree document is shown Figure 11, where Document list index created by the suffix tree index, which is also called link tree document structure. In this figure (i, j) means according to suffix tree index, i is Offset where as j is level no. Step 4: After creating the document index list we have the S-N structured index, which is the combination of the Ngram 8 Indian Journal of Science and Technology

9 Sangita Karmakar and Soumen Swarnakar Figure 12. The S-N structured index. Indian Journal of Science and Technology 9

10 New Concept based Indexing Technique for Search Engine index and suffix tree index. The whole tree like structure stored into the relational database to improve the information retrieval process by the any kind of search engine or information retrieval system. The S-N structured index shown in Figure Experimental Result The performance of searching using new S-N indexing is better in the sense of retrieving more relevant document than the use of existing indexing structure. Figure 13 shows that the accuracy of search results using S-N Indexing methods is 5% better than existing search engine indexing techniques. Figure 13. Comparison of accuracy of search engine results. 5. Conclusion In any information retrieval systems or search engines information retrieval process indexing method is the one of the key features. Arrangement and retrieve of the information can only do by the search index method. So the indexing methods or techniques are most important part of any retrieval system. As earlier discussion about the indexing there many existing index data structures present but they are not specifically sufficient to retrieve the information or documents according to the user requirement s or query. In many times they return some unwanted search results which is not required by the user. To avoid this situation we introduced a new indexing method which is able to retrieve the information or documents according the concept of the query or search. In S-N structured index mainly emphasizes on the concept of the retrieving information or search elements. So for this reason it retrieves the approximately good search results other then any indexing method or index data structures. 6. References 1. Al sayyed rizik MH, Al- Fawwaz B, Al-Adwan O, Hussam FN, Al-Mohannad KS. Search engines in website security leak. World Applied Sciences Journal. 2012; 20(5): Soumen S, Roshni R, Shriti S, Ritika R, Paulami G. A new approach to concept based document clustering and comparative study with hierarchical clustering. International journal of computer engineering and applications Apr; 140(7): Vijaya KPN, Raghunatha RV. A practical approach to working of web search engine. International Journal of Computer and Electronics Research Feb; 2(1): Satinder B, Rajender N. A comparative study of traditional search engines with the metasearch engines. Ultra scientist Apr; 21(2): Soumen S, Sangita K. Concept based categorization of documents for search engines. International Journal of Research in Engineering and Technology Oct; 4(10): Crossref 6. Milos I, Petar S, Mladen V. Suffix tree clustering-data mining algorithm, ERK. 2014; Sudhakar P, Poonkuzhali G, kumar R, Kishore. Content based ranking for search engines. Proceedings of International Multi Conference of Engineers and Computer Scientists, Hongkong Mar; 1. p Mudgil P, Sharma AK, Gupta P. An improved indexing mechanism to index web documents, IEEE. 2013; Crossref 9. Willett P. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management Jan; 24(5): Crossref 10. Sowmya R, Neeraja G, Vandita R. Search engines using evolutionary algorithms. International Journal of Communication Network Security. 2012; 1(4): Soumen S, Anjali T, Debapriya M, Debopriya P, Moutrisha P, Sreyashi R. Enhanced model of web page prediction using page rank and markov model. International Journal of Computer Application Apr; 140(7): Indian Journal of Science and Technology

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration