Generalized indexing and keyword search using User Log

Size: px

Start display at page:

Download "Generalized indexing and keyword search using User Log"

Jessie Paul
5 years ago
Views:

1 Generalized indexing and keyword search using User Log 1 Yogini Dingorkar, 2 S.Mohan Kumar, 3 Ankush Maind 1 M. Tech Scholar, 2 Coordinator, 3 Assistant Professor Department of Computer Science and Engineering, Tulsiramji Gaikwad-Patil College Of Engineering & Technology Nagpur, India. 1 yoginibangde@gmail.com, 2 tgpcet.mtech@gmail.com, 3 ankushmaind@gmail.com Abstract:- As database contain huge amount of data that data must be stored in efficient way so that it must retrieved in less time. There are various techniques which will store data properly. on data reduces both time needed to evaluate the queries and memory require to store the data. Today there are various s are available which perform compression on data but it requires decompression while retrieving it which increases the time complexity. Our system is based on indexing of large structured data in order to reduce time and space requirement. In our system we are using natural language processing on queries as well as on data to extract keywords. In this approach we are applying the algorithm which is based on intersection operation which will work on of indexes. In proposed system to reduce the of indexes we can also apply reordering algorithms. In this approach we are also using concept of logs which will useful while retrieving the data using queries. This paper gives comprehensive overview of the proposed system which will explain the compression of indexes using. I. INTRODUCTION: A significant amount of the world s enterprise data resides in relational databases. It is important that users be able to seamlessly search and browse information stored in these databases as well. The primary focus of designers of computing systems and data mining has been on the improvement of the system performance. According to this objective, the performance has been steadily growing driven by more efficient system design and improving complexities of the system. Our proposed system is based on compression of data using of indexes. Every time when we store the data, index file get generated which will contain the lexicons, indexes as well as the frequency of each word from that database. Efficient indexing required for storing the data in order to increase the searching performance. Searching is done by queries, and queries must be processed fast if the data is properly stored and managed. To improve the searching performance we can create users log to find out users frequent patterns. Besides searching various compression and reordering techniques are also available which require less memory and time. In our system for generating the indexes we have to find out the keywords by identifying the stemming and stop words, after identifying the keywords we can generates the indexes. The sequential indexes with less are generated by reordering algorithms. Different Searching techniques uses the union and intersection operations to find the results of queries, these s works on OR query and AND query semantics [1] researchers Hao Wu, Gauliang Li and Lizhu Zhau presents SCANLINEUNION+ and PROBISECT+ algorithms in which PROBISECT+ works better for searching because it is faster and avoids unnecessary probes. In proposed technique we are using PROBISECT+ algorithm for intersecting actual data and keywords present in the queries so that the exact result can be obtained. For compression of data different encoding s are available like, Variable byte encoding [10] scheme which is 2x faster than the Variable bit encoding scheme. It very simple byte wise compression scheme. Uses 7 bits to code the data portion and the most significant bit is reserved as a flag bit which indicate if the next byte is still part of the current data VBE compression reduces cost of transferring data from memory to the CPU than that of transferring uncompressed data. The P For Delta encoding [3,7] compression classify inverted list into either coded or Exception values. Exception values are stored in to uncompressed form but we still maintain the slots from them in their corresponding positions and coded values are assigns with the arbitrary bit width b which kept constant within a disk block. Inverted list divided into blocks. In proposed system we are using different reordering technique which is required for ordering the data to generate the lists containing fewer which require less space for storage. Shieh et al. [9] proposed a DocID reassignment algorithm adopting a Travelling Salesman Problem (TSP) heuristic it is graph based system. Blelloch and Blandford [5] also proposed an algorithm called B&B. This algorithm permutes the document identifiers in order to enhance the clustering property of posting lists. This algorithm creates 11

2 similarity Graph G from IF index, each document consider as vertex of graph the edges of the graph are weighted by considering cosine similarity measure between each pair of documents. Then graph G recursively splits into smaller subgraphs to generate singleton. The depth_first traversing is applied on tree to reassign the DocIDs. Silvestri[5] show that in the case of collections of Web Documents the performance of compression algorithms can enhance by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. SIGSORT [1] algorithm works by generating signature of the words for that a summary of each document is generated then words are arranged in descending order of their frequencies. SIGSORT is more suitable for structured and short text data and can handle large data. It provides higher clustering power. In the The remainder of the paper is organized as follows. Section 2 describes the overview of the proposed systems. Section3 gives detail about the use of natural language in proposed. Section 4 describes about how intersection algorithm works in proposed. Section 5 describes the use of generating logs. Section 6 describes the reordering technique applied to improve the sequence of. Section 7 shows the experimental results. The paper concludes in Section 8. II. OUTLINE OF THE PROPOSED METHOD The diagram gives the idea about how the proposed system works, the user sends the query which is given to NLP (natural language processing) to identify the keywords then this keywords are searched from the index table which is created by using the at same time the logs are checked for the related data to reduce the searching time. While storing the data in index form and interval form the indexer first applies the natural language processing on the whole data to identify the keywords. Based on the keyword positions the indexes are assigns to the key words. After assigning the keywords the IDs table is formed and from that IDs we are find the, to store the indexes in interval form. To generate the proper and sequential indexes the reordering algorithm will apply on the IDs table. By reassigning document identifiers of the original collection, lowers the distance between the positions of documents. Steps for execution 1: Identifying Keywords from document by using NLP techniques 2: Assigning the indexes for each keyword 3: Reodering the document by using SIGSORT and TSP 4: Generating the index file by considering interval of indexes 5: Preparing and updating Log file of users after each activity 6: while Query is fired then check the results in log file 7: If result is not present in log file then Search the result in index file 12 Else Go to step 5 8: Return the result III. IDENTIFYING KEYWORDS In proposed we are using the natural language processing to identifying the actual meaningful words from data and query. We are applying the stemming on data and queries to find the root form of the words. If the words are ending with ed, ing, ly then stemming process reduces the inflected or derived words to their stem or root, for example interfaced, interfacing are converted in to interface. We also filtered out the stop words from query and data. Stop words are words which are filterd out prior to, or after, processing of natural language data. We remove the words as the, is, which and so on and only consider the keywords and assign the IDs to them. 2.1 OUTLINE OF THE PROPOSED SYSTEM IV. PREPARING LOGS In proposed we further investigated the issue of developing high-quality and effective IR system by combining log concept while processing the query. which enables you to create and manage search logs from information recorded by the previous search. The

3 search technique stores raw search logs, from which it generates user-requested search log reports. Log files contain information about User Name, Time Stamp, Access Request, Result Status. The log files are maintained by the system. By analyzing these log files gives a neat idea about the user behavior. Log generation is performed by using following steps. Algorithm for log generation Creation of user log Step 1: Enter user_id and password Step 2: If User_Id and password mathched go to step 3 Else Again Enter user validation Step 4: Create user session_id Step5: Step6: Step7: Step8: Step9: while(session_id) Monitor activity of user Update database of user end while end procedure V. PROBE BASED ALGORITHM The probe based algorithm is based on intersection operation. As the keywords in queries are of different length the probe based algorithm are suited for the retrieving and storing the data. These probe based algorithm which is used in proposed is based on intersection operation, following explains the working of intersection on set of indexes. Definition : Given a set of interval lists, R ={R 1, R2 Rn }, and their equivalent ID lists, S = {S1, S2,.. Sn}, the intersection of R is the equivalent interval list of the intersection of R is the equivalent interval list of n k=1 S k. For example we can if we have the as {[5,8], [12,14]},{[6,8], [13,16]} and {[4,9], [14,14], [16,25]}. Their equivalent IDs are {5,6,7,8,12,13,14}, {6,7,8,13, 14,15,16}, {4,5,6,8,9,11,16,17,18,19,20,21,22,23,24,25} then intersection will produce the result as {6,7,8,14} which will produce the as {[6,8],[14,14]}. The proposed system works on the probes and this algorithm is faster in query based keyword search. The probe based system is efficient than the sequential scan. The probe based algorithm uses the binary search algorithm having complexity O(log m) to find the keywords and avoids unnecessary probes by calling the function recursively. Our concept uses the probisect+ [1] algorithm whose complexity is as shown C P = O(min(log n Σ K J R K }). The probe based algorithm takes R as set of interval lists and sorts the R in ascending order of lower bounds. The PROBISECT + algorithm use the concept of intersection operation and calculate the intersection list of a set of ordered lists. The probe based algorithm probes the ordered list sequentially and terminate the unpromising probes. This probing function called recursively to avoid the empty and unpromising probes. Reordering data Reordering of data is necessary for generating the best order of the document. If the data is reordered, in order to generate sequential indexes then the memory requirement will automatically get reduced and searching will also get improved. Reordering algorithms are used to find the optimal ordering of document so that similar documents stay near to each other. Silvestri[5] suggested a in which the webpages are arranged according to the URLs. The similar concept is used document to sort the document according to their summery so that the similar document can be keep near to each other. For sorting Summaries can be generated as follows. First, all the words are sorted in descending order of their frequencies. Then, the top n (e.g., n D 1000 ) most frequent words are chosen as signature vocabulary. For each document, a string, called a signature, is generated by choosing those words belong to the signature vocabulary and sorting them in descending order of their frequencies. The document sorting compares each pair of signatures word-wise instead of comparing them letter-wise. In proposed approch the signature sorting algorithm is used to sort the document according to the similarity of document and TSP is used to identify the document with similar signature. Experimental Results The experimental results include performance of indexing verses indexing with which is compared in table a and figure a. Various queries are executed for temporal analysis and some of them are listed in table which conclude that the performance is get improved by finding the efficient. Figure: a shows the single query graph in which we can clearly see that the time require for indexing is greater than the find indexing with. The time require for traditional indexing is 4.32 ms and indexing with require 3.90 ms. Table a: performance of indexing Vs indexing with Query required for what is interfaces Dictionary in how to use file class in arrays and vector class in re quire for with 13

how to use arrays in 2.32 1.00 Hashtable in 2.11 1.61 How to use packages in 3.60 2.90 use of linklist and stack in 5.2 4.

4 how to use arrays in Hashtable in How to use packages in use of linklist and stack in Figure a: Vs with Performance of indexing with and log The performance by incorporating Log with indexing with is shown in Table b and Figure b. Figure shows the single query graph in which we can clearly see that the time require for indexing with is 3.79 ms and time required for implemented is 1.60 ms which is near about half of existing. Figure b: using Intervals Vs with Intervals and log Table b: Performance of indexing with and log Query required for with what is interfaces Dictionary in require for logs how to use file class in arrays and vector class in how to use arrays in Hashtable in How to use packages in use of linklist and stack in VI. CONCLUSIONS This paper presented the of indexing which will work on the interval of indexes which will help to reduce the memory requirement as well as it uses the users log which will help to reduce the retrieval time. The graphical comparative shows that performance of traditional indexing is get improved due the concept of of indexes the extended concept using logs proves that the time required for retrieving process is reduced near about half compare to the existing system. In this approach searching techniques uses the PROBISECT+ algorithm which is based on intersection operations to find the results of queries. Reordering technique applied to reduce the and generate the sequence of indexes which will generates the efficient and reduces the memory require to store the indexes. Along with indexing the user logs used while searching is greatly improve the performance. REFERENCES [1] Hao Wu, Guoliang Li, and Lizhu Zhou, Ginix: Generalized Inverted Index for Keyword Search IEEE TRANSACTIONS ON KNOWLEDGE AND DATA MINING VOL:8 NO:1 YEAR 2013 [2] Vijayashri Losarwar, Dr. Madhuri Joshi Data Preprocessing in Web Usage Mining International Conference on Artificial Intelligence and Embedded Systems (ICAIES' 2012) July 15-16, 2012 Singapore. [3] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D.Srivastava, Fast indexes and algorithms for set similarity selection queries, in Proc. of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008,pp [4] J. Zhang, X. Long, and T. Suel, Performance of compressed inverted list caching in search engines, in Proc.of the 17th International Conference on World Wide Web, Beijing, China, 2008, pp [5] F. Silvestri, Sorting out the document identifier assignment problem, in Proc. of the 29th European Conference on IR Research, Rome, Italy, 2007, pp [6] R. Blanco and A. Barreiro, TSP and cluster-based solutions to the reassignment of document 14

5 identifiers, Information Retrieval, vol. 9, no. 4, pp , [7] M. Zukowski, S. Hman, N. Nes, and P. A. Boncz, Superscalar RAM-CPU cache compression, in Proc. of the 22 nd International Conference on Data Engineering, Atlanta, Georgia, USA, 2006, pp. 59. [8] J. Zobel and A. Moffat, Inverted files for text search engines, ACM Computing Surveys, vol. 38, no. 2, pp. 6, [9] Wann-Yun Shieh, T ien-fu Che n, Jean J yh-jiun Shann, and Chung-Ping Chung.Inve rted file compre ssion through do cument identifie r reas signment. Information Process in g and M anage men t, 39(1): , January [10] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of inverted indexes for fast query evaluation, in Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tammpere, Finland, 2002, pp [11] B& B]Dan Blandford and Guy Ble llo ch. Index compression through document reordering. In Proceedings of the D ata Compression Confere nce (DCC 02), pages ,Was hington, DC, USA, IEEE Computer Society. [12] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT- 21(2):194{203, Mar [13] S. Golomb. Run-length encodings. IEEE Transactions on Information Theory, IT{12(3):399{401, July

Ginix: Generalized Inverted Index for Keyword Search

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA MINING VOL:8 NO:1 YEAR 2013 Ginix: Generalized Inverted Index for Keyword Search Hao Wu, Guoliang Li, and Lizhu Zhou Abstract: Keyword search has become a ubiquitous