A. Krishna Mohan *1, Harika Yelisala #1, MHM Krishna Prasad #2

Size: px

Start display at page:

Download "A. Krishna Mohan *1, Harika Yelisala #1, MHM Krishna Prasad #2"

Robyn Bell
6 years ago
Views:

1 Vol. 2, Issue 3, May-Jun 212, pp IR Tree An Adept Index for Handling Geographic Document ing A. Krishna Mohan *1, Harika Yelisala #1, MHM Krishna Prasad #2 *1,#2 Associate professor, Dept. of Computer Science Jawaharlal Nehru Technological University Kakinada, Andhra Pradesh. #1 Mtech, Department of Computer Science, JNTU Kakinada, Andhra Pradesh. Abstract In processing a location based query, that is propel to a geographic search engine, it retrieves documents that are more textual and spatially relevant. After retrieval search engine ranks the documents based on their combined relevance of textual and spatial aspects to query. Here in this case, both textual and spatial information should be indexed. Existing systems cannot handle both textual and spatial aspects simultaneously, so an adept index is needed that can handle textual and spatial aspects simultaneously. In this paper, we propose an efficient index called IR Tree which indexes textual and spatial information in an integrated way. IR Tree along with Top K document retrieval algorithm performs spatial and textual filtering, relevance computation, and also document ranking by considering the document length i.e., relevance is computed using document length. Keywords- Geographic document searching, indexing, IR Tree, Top- K document Retrieval. 1. INTRODUCTION Internet is the best option to gather information these days. Any information is stored in form of documents, so storage of large number of documents is required. In order to access few documents search engine has to check all the documents one by one. So to avoid that situation an efficient search engine is needed. A search engine is said to be efficient if it retrieves documents that are highly relevant and retrieved in a short latency. To do that an adept index is needed. Index is a data structure that improves the speed of data retrieval operations on a database table. It helps in efficient retrieval and access of required documents. Location based queries are the queries which are related to some location. Location based query are used in many ways like finding weather in particular location, knowing tourist spots. In some cases interviewer wants to check the capability of the applicant. At times the applicant is given some unfamiliar address as the interview venue. So the applicant searches the location along with the name of the company. For example let the venue be near Rose Burg hotel, Oregon. Now the applicant can search the web to get the information. Some of the existing searching engines retrieve the information that are related to Roseburg and some documents that are related to hotels, some related to Oregon place. So to get the correct information search engine must consider the entire keywords Roseburg hotel Oregon place into consideration. Let the company name be ABC. So search engine must help the applicant to get the correct and relevant information. Example1. Let there be 6 documents or web pages in the web server. W={W 1, W 2,. W 6 }. These are the locations that are closely located to Oregon. Each web page consists of some textual words too. The frequencies of three words are shown in the fig 1. Now the problem is to find the documents that are most relevant. A document is said to be concerned if at least one keyword is matched and along with that it should overlap with the given query location. If more textual words are matches and if the document location overlaps or is merged in the query spatial scope then it is said to be more relevant. In all the documents w 1 is more related to given query as the frequency of each word is more matched. So W 1 is first retrieved and then w 5 (if the documents to be retrieved is 2) If the documents are more in number then an efficient index is needed. In this paper, we propose spatial relevance, Textual relevance, ranking in an adept way. Figure 1: Representation of document spatial distribution and frequency of each query word in document set 2. RELATED WORK There are two types of existing systems that can handle location based queries. They are: 2.1. Individual index structures These are the indexing structures where two indexing structures are needed such that one for spatial and one for textual.[1], [2]. Query processing is done in such a way that textual documents are filtered using textual index and now the textually filtered spatial documents are filtered using spatial index. Now the relevance of the intermediate results is computed based on the joint textual and spatial aspects. The main disadvantage of this kind of indexing structure are intermediate results obtained in between textual and spatial filtering may contain very large sets of documents that may not lie under the required spatial scope. Processing time and memory to store those intermediate results is not that well advised. So hybrid index structures came into existence P a g e

Vol. 2, Issue 3, May-Jun 212, pp.1433-1438 Building two indexing structures and implementing relevance computation is also expensive. 2.2. Hybrid Index Structures: These are the indexing structures where spatial and textual aspects are considered simultaneously.

2 Vol. 2, Issue 3, May-Jun 212, pp Building two indexing structures and implementing relevance computation is also expensive Hybrid Index Structures: These are the indexing structures where spatial and textual aspects are considered simultaneously. Indexing is built one over another. Here search is done for textual aspects based on the spatial aspects or vice versa. In some hybrid indexes the location index and textual index are combined to a new word that forms a new index [1], [2]. And these kinds cannot be used by many spatial based applications. And also memory for newly created index words is also needed. Some of the hybrid indexes like KR-* use R* index inverted file structure by combining each node of R tree with the list of all the textual keywords present in the object of that sub tree. 3. PROBLEM DEFINITION Documents are nothing but web pages. Let web pages be denoted as W= {W 1, W 2,., W n } which consists of say n pages. Each page consists of set of textual keywords T w and set of spatial locations S w. Location based query is the query which specifies set of textual keywords Tq and spatial scope Sq. In this paper query is represented as Q (Tq, Sq) and web page as W (T w, S w ). General Textual Relevance computation: A web page w is said to be textually relevant to query Q(tq, sq) if all or some of the textual keywords of the query are present in web page. Spatial Relevance computation: A location in the document w is said to spatially related to query location Sq only if document location S w overlaps either completely or partial with the query location S q. Geographic Web page searching: ing is finding the documents that are both textually and spatially related to a given query. In technical way, Geographic document search engine identifies the web pages in web pages set W that are both textually and spatially relevant to the query Q (Tq, Sq). Geographic Webpage Ranking: Ranking the geographic documents based on some order is called Geographic Webpage Ranking. User is interested in only top K web pages. All the documents are retrieved such that descending relevance order is taken, and retrieved accordingly. As Textual and spatial are two different aspects and have their own importance, for relevance computation they are assigned some weights. It is represented as: R(W) = δ* T(W)+ (1-δ)* S(W) iff T(W)> and S(W) > = otherwise Relevance of a word is estimated using the weighted sum of textual and spatial relevance where δ is a parameter to manage the relative weights. Document length: Documents vary by their length. There may be two documents that are relevant to the given query. When compared one document may be very small and other may be very large. On computing the relevance the both may be interesting ones but the large one may have more information than smaller one, even then the smaller document may be more relevant. So document length must be considered in computing the relevance. 4. IR TREE IR tree is a tree data structure which is used as an index to handle location based queries. IR tree is designed such that it performs spatial clustering first and then textual filtering. Here first spatial filtering is done so that search space can be abridged because there may be many documents that are textually related but only very few of those are bounded within spatial scope. Now textual filtering is done so as to reduce search cost. Finally, the joint relevance and ranking is done simultaneously such that, as soon as top k (the number of documents to be retrieved) documents are obtained the search process stops. Coming to the design issue, index structure must be designed in proper way as each textual word in documents is treated as a dimension. Document space need to cover many very high dimensional spaces. In addition to that spatial locations and textual words have their own representations and measurements. So index must integrate these two aspects so that they must be compatible. Our IR Tree is designed to perform spatial filtering, textual filtering, relevance computation, and ranking simultaneously. Even storage and access overheads are considered IR Tree Strucutre IR tree is designed in such a way that it clusters spatial documents and abstracts textual documents under various granularities [1]. All the spatially related documents are clustered so that any document that does not belong to that region requested by the user, can be pruned as and then as unrelated. All textual words are represented using inverted files. Each node has document précis such that if the query keyword is present in that node then it can traverse according to the nodes pointing it. IR tree is a collection of nodes. It consists of a root node, few non leaf nodes, and few leaf nodes Leaf nodes Each leaf node is linked to an inverted file. All the inverted files consist of list of words, such that each word is pointed to list of documents that contain the particular word. It can be represented as shown in fig 2. Figure 2: Leaf node representation Non leaf nodes 1434 P a g e

Vol. 2, Issue 3, May-Jun 212, pp.1433-1438 All the non leaf nodes consist of document précis.

3 Vol. 2, Issue 3, May-Jun 212, pp All the non leaf nodes consist of document précis. Document précis is nothing but collection of information regarding node s spatial region, number of documents that come under that particular node. It even contains the WF and IWF. It is shown in fig 3. In brief, let the non leaf node be node i, then Figure 3: Non leaf node representation. assuming node i will have many children nodes to node i. Document précis contains 1. M i : It is the Minimal Bounding Box that covers all the locations of the documents under node i. It is nothing but a small rectangular region that covers all the locations in the document set under the node i. 2. W i : It is the cardinality of the documents that come under the node i. i.e., the number of documents that come under node i. 3. WF and IWF pair values: WF t,w is the Word Frequency i.e.; it is the measure of frequency of a word t that occurs in a document w. IWF t,w is the Inverse Web page Frequency [6], the number of documents in the document set W that contain one or more occurrences of textual word t. This pair helps in computing the relevance just by checking the WF and IWF values as the node need not be considered if the pair value is low IR Tree operations Major operations performed on any Tree structures are insertion, deletion, traversal, and update. First tree has to be constructed, for that set of documents is needed. As mentioned earlier, IR tree first spatially clusters all the documents. i.e., all the spatially related locations are grouped into a set. IR tree construction involves a bottom up fashion. For example, as shown in fig 4, all the documents that are related to Los Angeles are grouped and the documents related to Oakley are grouped and those are grouped under California. Generalization is done in IR tree construction. Figure 4: Spatial Clustering Representation. Here we assume that every document is associated with one location. Algorithm for IR tree construction: Input: A document set, W; Minimal node Fan out, Min; Maximal node Fan out, Max; Output: Root of IR tree. Method: 1: N i 2: for each w ε W do 3: { 4: geocode w and represent S w with MBB M w ; 5: if i ε N i, M i =M w then 6: update w to i s document set W i ; 7: else 8: create new entry i; 9: set M i M w and also W i {w}; 1: N i N i {i}; 11: } 12: for each i ε N i do 13: { 14: construct inverted list for each word in the document of N i ; 15: } 16: while N i > max do 17: { 18: cluster N i based on the min and max into nodes represented as new entries N`I; 19: Prepare document précis for i in N`I; 2: N i N`I; 21: } 22: create root node to layer N i and its document précis; 23: return root node as output; In brief, (Line 1) first a root node is initialized to null and then (Line 2 11)every webpage in the webpage set is geocoded and all the locations of the web pages are represented using Maximum Bounding Boxes. At the same time, clustering is done by the spatially relevant ones. If the document location matches with the existing node set then it can be grouped, otherwise we create another node set. (Line 12-15) construct the inverted list for each word in that node. After that arrange nodes as per max and minimum fan outs and then root is created and returned as output. Insertion, deletion involves updating of node sets, document précis and these insertion and deletion are similar to R tree operations Query Processing In order to process the query, WF and IWF values are to be calculated. IWF calculation is explained in algorithm 2. Finally Top K document retrieval is explained in algorithm 3. Algorithm 2: Input: Root of the tree R, Query (T q, S q ) Output: IWF values for each word in the query, Set of nodes N; Method: 1: set W s =; WF t = for each t in T q ; 2: push R to an empty stack K; 3: while K is not empty do 4: { 1435 P a g e

4 Vol. 2, Issue 3, May-Jun 212, pp : pop an entry i from K; 6: if M i S q then 7: if M i S q then 8: W s W s + W i ; 9: if t Tq > then 1: WF t WF t + WF t,w, t T q ; 11: B B {i}; 12: end if 13: else 14: push all child entries of i to K; 15: end if; 16: end if; 17: end while; 18: Return IWF t,w,sq = ln(1+ Ws /WF t ); This algorithtm calculates IWF value for each Query keyword. First all the nodes are taken into an empty stack and then if the spatial location of query and document location overlaps then the documents under the node are counted. Along with that WF values are also updated and finally the nodes which are mapped with query scope and processed are taken into buffer B. Algorithm 3: Top K document retrieval Buffer B is passed to the Top K document retrieval algorithm so that all the nodes that are related to query are obtained. The main motive is that k documents that mostly spatially and textually related are to be retrieved. Once k documents are retrieved the process must halt. Algorithm 3: Input: interest set B,<WF,IWF> pair, Q ( T q, S q ), δ), k value; Output: K documents retrieved C Method: 1: # define R(i) = δ* t Tq WF IWF WL + (1-δ)/dist(M i, S q ); // consideration of document length 2: for i B do 3: enqueue (i, R(i)) to Q; //all the entries in B are queued.// end for 4: while Q is not empty do 5: dequeue an entry j from Q; 6: if j is a document then 7: C C {j}; 8: if C = k then 9: goto 16; //end if 1: else if j is a leaf entry then 11: for each document w in j s inverted list, S w, t T q do 12: enqueue (w, R (w)) to Q; //end for 13: else 14: for each child g to j do 15: enqueue (g, R (g)) to Q; // end for, if, while 16: output C. 5. EXPERIMENTAL RESULTS These are the results that are examined based on two real data sets namely LATimes 94 and Factiva. The data set LATimes 94 consists of 11,273 documents which includes 2,119 locations. Its average number of words per a document is 54 and number of indexed document words are 9,986. The total size is 421 MB. The other data set Factiva consists of 38,76 documents which includes 47 locations. Its average number of words per documents is 522 and number of indexed document words are 13,286. The total size is 256 MB. First, all the location names are extracted from every single document and these locations are geocoded into Maximum Bounding Boxes (MBB). Geocode is done based on the ontology that covers over 1, 29,784 worldwide locations. Considering some factors like area size, population size...etc of locations they are divided as small city, medium city, large city, state, country these MBB are constructed [3]. Experimental results are generated based on the search time by considering different parameters. Here search time is compared with KR * and Hybrid R. KR * tree is the index structure that combines the textual words with spatial objects in non leaf nodes such that it supports both textual and spatial filtering simultaneously and then it ranks based on joint relevance. Hybrid R is index structure which is implemented by filtering spatial documents first and then textual relevant ones as R tree is placed on the top of the inverted files. ing Efficiency: Efficiency is estimated based on some factors which have their own impact on searching [4], [5]. There are three most important factors to be considered. They are: 1. Size of the query spatial scope S ; 2. Number of requested documents k; 3. Relative significance of textual relevance to spatial relevance. By keeping two parameters constant and varying one parameter the search time will be varied. So let us see how the search time varied for the above mentioned data sets Impact of S : Query spatial scope size is one of the important factors that change the performance of the searching. The values that suit the query spatial scope (in km) are: 1 2, 2 2, 1 2, 5 2. As the spatial locations are represented as Maximum Bounding Boxes they are measured in the form of square area. So in our IR implementation the scope is limited from 1*1 km 2 to 5*5 km 2. First by keeping k value, i.e., number of documents value and relative textual relevance δ constant, say k=1 i.e., retrieval of top 1 documents and δ value to.5 and varying the query spatial scope size, the results are 1436 P a g e

Vol. 2, Issue 3, May-Jun 212, pp.1433-1438 obtained Figure 8: Effect of K on data set LAtimes 94. 1 Figure 5: Effect of s on the data set Factiva. as per the range of the spatial scope.

Impact of K: The number of documents to be retrieved, K is one important factor which is the main estimate of time. The k value can be 1, 3, 5, 1, 3..etc. By fixing S =1*1 km 2 and δ to.

5 Vol. 2, Issue 3, May-Jun 212, pp obtained Figure 8: Effect of K on data set LAtimes Figure 5: Effect of s on the data set Factiva. as per the range of the spatial scope. For the dataset Factiva and LAtimes 94 the results are plotted as shown in fig 5 and fig 6 respectively. Figure 6: Effect of S on the data set LAtimes Impact of K: The number of documents to be retrieved, K is one important factor which is the main estimate of time. The k value can be 1, 3, 5, 1, 3..etc. By fixing S =1*1 km 2 and δ to.5 and varying k and applying for the taken datasets leads to the shown variations in the fig 7 and fig 8. In both cases IR tree performs well when compared to KR * and Hybrid r. Of all the index structures, IR performs best and Hybrid r performs very badly. 5 Figure 7: Effect of K on data set Factiva. 5 Factiva S LAtimes' S K IR TREE tree hybrid IR Tree Hybridr Whatever may be the k value, IR Tree retrieved the documents in very efficient time LA'times94 5 Factiva K IR Tree IR TREE Hybridr Impact of δ: The third factor to be considered is the relative significance of textual and spatial relevance. Hybrid and KR * tree retrieves the documents and then it ranks the documents based on joint significance. Whereas IR tree simultaneously computes the relevance and ranks the document. But, as δ increases the search time also increases. Now by keeping δ value varying and fixing K and S values to 1 and 1*1 km 2 we got the results as plotted in fig 9 and fig 1 for the data sets taken. When compared to Hybrid r and KR *, IR tree performs very well because of it storage structure and its top k document retrieval algorithm. A lot of time is saved due to incremental top k search algorithm, so there is no need to rank the documents after retrieval. 5 LAtimes' δ Figure 9: Effect of δ on data set LAtimes Time( ms) Factiva δ IR TREE Figure 1: Effect of δ on data set Factive. IR tree Hybridr Even though the storage is a bit more due to presence of WF and IWF in internal nodes, but that makes search efficient as it need not enter the nodes that does not satisfy the location or the word. 6. CONCLUSION In this paper, we proposed an efficient index structure namely IR Tree, that handles textual filtering, spatial filtering at a stretch. We mainly focused on consideration of document length in relevance computation. All the top k documents are retrieved in incremental approach. The experiment proved that IR Tree is an adept index and performed sound. Future work can be enhanced to perform semantic similarity between query keywords P a g e

6 Vol. 2, Issue 3, May-Jun 212, pp REFERENCES [1]. Zhisheng Li, Ken C.K Lee, Baihua Zheng, Wang-Chien, Dik Lun Lee, Xufa Wang, IR Harika Yelisala received her BTech from the Department of Tree: An Efficient Index for Geographic Computer Science, Acharya Nagarjuna university in the year Document IEEE Vol 23, No 4, April 29. She is pursuing her Mtech in Jawaharlal Nehru 211. Technological University, Kakinada campus during 21- [2]. Khodaei A, Cyrus Shahabi, Chen Li, SKIF-P: 212. Her research interests includes Cloud Computing, Data A point based Indexing and Ranking of Web Mining. Documents for Spatial Keyword Geoinformatica. Krishna Mohan A is currently working as an Associate [3]. E. Amitay, N.Har El, R. Sivan, A. Soffer, Professor under the department of Computer Science in Web-A-Where: Geotagging Web Content, Jawaharlal Nehru Technological University, Kakinada. His Proc. ACM Sigir 4, pp research includes Data Mining. [4]. I.D. Felipe, V. Hristidis, N.Rishe, Keyword on Spatial Databases, Proc IEEE 24 th Krishna Prasad MHM is currently working as an Associate Int l conference Data Engg(ICDE 8), pp 656- Professor under the department of Computer Science in 665, 28. Jawaharlal Nehru Technological University, vijayanagaram. His research includes Data Mining. [5]. Khodaei A, Shahabi C, Li C(21), Hybird Indexing and Seamless Ranking of Spatial and Textual Features of Web Documents in DEXA, PP [6]. D. Hiemstra, A Probabilistic Justification for using TF IDF Term Weighting in Information Retrieval, Int l j.digital Libraries, Vol 3, No 2, pp , P a g e

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com