Index and Search for Hierarchical Key-value Data Store

Size: px

Start display at page:

Download "Index and Search for Hierarchical Key-value Data Store"

Dina Henderson
6 years ago
Views:

1 Journal of Computational Information Systems 11: 4 (2015) Available at Index and Search for Hierarchical Key-value Data Store Yongqing ZHU, Yang YU, Willie NG, Samsudin JUNIARTO Data Storage Institute, A*STAR (Agency for Science, Technology and Research), Singapore , Singapore Abstract Existing key-value stores support web applications with high scalability and availability. However, the limited data access via primary keys has restricted data retrieval from key-value stores. In this paper, we have proposed a searchable key-value store with hierarchical data model, HierKV, to support fast data retrieval and search with secondary attributes. HierKV accelerates data retrieval by attributes via a hierarchical data model and index structure. A hierarchical TF-IDF ranking mechanism is proposed for HierKV to properly reflect keyword matches in different hierarchies. Experiments have shown that the proposed HierKV system can achieve good search performance. Keywords: Key-value Store; Data Retrieval; Hierarchy; Keyword Search; Ranking 1 Introduction Key-value stores [1-3] have emerged in recent years to provide distributed data storage for largescale web applications. The high scalability and availability offered by key-value stores are much valuable properties that traditional database systems cannot afford. However, in most key-value store systems, data access is provided at the granularity of primary keys with simple APIs get, put, and delete. While modern applications require retrieving data with attributes other than the primary keys, it is necessary to index key-value data with secondary attributes and enable rich query features in key-value stores. In many applications, data are correlated by attributes and the correlated data are normally retrieved together. A good example is online forums, where an original post can have multiple reply posts, and a reply post can have its own reply posts as well. All these posts are correlated by the same topic and can form a hierarchical tree with the original post in the root and reply posts in the middle and leaf nodes. Traditional database systems store these correlated posts as individual data records. When these posts are retrieved based on topics, multiple disk reads are needed to read the individual data records one by one. If the correlated posts could be stored as a single data record, a single disk read will retrieve all these posts with much faster performance. There is a need to enclose all correlated data in a single data record properly. This paper presents a high-performance searchable key-value store system, HierKV, which keeps data in a hierarchical data model and provides rich search features as well. HierKV groups the Corresponding author. address: yqzhu@dsi.a-star.edu.sg (Yongqing ZHU) / Copyright 2015 Binary Information Press DOI: /jcis13141 February 15, 2015

2 1180 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) correlated data together and encloses them in a single hierarchical data record, which can speed up data retrieval by the attributes. An index structure is presented to index HierKV data, so that HierKV can provide quick search by the secondary attributes. In addition, a hierarchical TF-IDF ranking mechanism is proposed for HierKV to rank the search results to properly reflect different relevance degrees of keyword matching in different hierarchies. The rest of the paper is organized as follows: Part 2 describes the HierKV data model and API; Part 3 elaborates HierKV data retrieval including index and search, as well as the novel ranking mechanism; Part 4 includes experiments and evaluation; and conclusions are included in Part 5. 2 HierKV Data Model and API Most key-value stores deploy either ring-based architecture [2-4] or tablet-based architecture [1, 5]. These systems generally operate via the basic key-value interfaces and access data by primary keys only. The proposed HierKV system provides not only the basic interfaces for key-value data access, but also rich query interface for search with secondary attributes. HierKV stores data as records in tables. Each record consists of a primary key as identifier, and one or more secondary attributes. In order to accelerate data retrieval by attributes, HierKV groups the correlated data by the secondary attributes and encloses them within a single data record. We introduce a hierarchical structure to organize the value of data record since the value may contain different levels/hierarchies of information, e.g. different posts in a hierarchical tree. Each hierarchy includes one or more attributes and the corresponding values. Consider a table named P osts that keeps information for all original posts and their reply posts in an online forum. An original post and its reply posts can form a hierarchical tree structure as shown in Fig. 1. Any recommendation on iphone? Is there anybody having experience in iphone? Which provider is better: SingTel, Starhub or M1? Re: Any recommendation on iphone? I only have experience with Starhub, not bad. Re: Re: Any recommendation on iphone? I heard Starhub is cheaper than SingTel, right? Re: Any recommendation on iphone? Can iphone use the same number? Re: Re: Any recommendation on iphone? Prepaid does not enjoy the portability from one SIM to another. Re: Re: Any recommendation on iphone? You d better check with service provider on this portability issue. Re: Re: Re: Any recommendation on iphone? Which department is responsible for this? Fig. 1: Hierarchical tree formed by correlated data Fig. 2 illustrates how the group of posts in Fig. 1 are organized in a single data record in table Posts with hierarchical model. The record is identified by the primary key post The value includes four hierarchies of information, representing four levels of posts in the hierarchical tree. The higher hierarchy resides in the value of the previous lower hierarchy. In the example record, the first hierarchy contains attributes T opic,, Reply1, Reply2, etc. that correspond to the original post and its direct reply posts in the tree, followed by the second

Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) 1179 1186 1181 hierarchy and so on.

When data is retrieved by topics, all these posts can be retrieved with a single disk read quickly.

Which provider is better: SingTel, Starhub or M1? I only have experien ce with Starhub, not bad. Reply1 I heard Starhub is cheaper than SingTel, right? Can iphone use the same number?

Primary key Value Fig. 2: Example data record with hierarchical model HierKV provides rich APIs for data access and search.

In addition, a search API is provided to search and retrieve data based on the secondary attributes. It defines a set of operations (e.g. $equal, $more, $less, $has, $phrase, etc.

An example query command used to search data records that contain keyword within attribute T opic can be expressed as: { search : { keyword : { $has :, path : { $equal : T opic in JSON format.

If the application needs to retrieve data by other attributes, it will take long time to scan the whole data collection and match to the attributes.

3 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) hierarchy and so on. With this hierarchy model, all posts correlated by the same topic can be enclosed within a single data record. When data is retrieved by topics, all these posts can be retrieved with a single disk read quickly. 1 st hierarchy 2 nd hierarchy 3 rd hierarchy 4 th hierarchy Topic Reply 1 Reply 2 post_ Any recomm endation n on iphone Is there anybody having experience in iphone? Which provider is better: SingTel, Starhub or M1? I only have experien ce with Starhub, not bad. Reply1 I heard Starhub is cheaper than SingTel, right? Can iphone use the same number? Reply1 Prepaid does not enjoy the portability from one SIM to another. You d better check with service provider on this portability issue. Reply2 Reply1 Which department is responsible for this? Primary key Value Fig. 2: Example data record with hierarchical model HierKV provides rich APIs for data access and search. The basic APIs include get, put, and delete to read, write/update, and delete data with primary keys from the system. In addition, a search API is provided to search and retrieve data based on the secondary attributes. It defines a set of operations (e.g. $equal, $more, $less, $has, $phrase, etc.) to facilitate full-text keyword search by point search, range search, and phrase search. The search API provides t- wo attributes keyword and path to define the query command. An example query command used to search data records that contain keyword within attribute T opic can be expressed as: { search : { keyword : { $has :, path : { $equal : T opic in JSON format. The detailed search features will be described in Part Data Retrieval for HierKV Generally, data retrieval from key-value stores is based on primary keys only. If the application needs to retrieve data by other attributes, it will take long time to scan the whole data collection and match to the attributes. Recently, some key-value stores [6-8] appear to support search with secondary attributes. However, both [6] and [7] did not index data by the secondary attributes, so they need to traverse the partitions to identify the potential search results. In HierKV, data records are indexed with the secondary attributes when they are inserted the first time. Search and data retrieval can be accelerated by matching the query keywords against the index records instead of scanning the whole data collection. HierKV uses a hierarchical structure to organize the correlated data, with different levels of information stored in different hierarchies. The query commands can use this hierarchy information to search data records with given keywords inside specific attributes. Hence the index record should keep the hierarchy information to fulfill such query requirements. Keyword matches in different hierarchies have different degrees of relevance to the query. We propose a hierarchical TF-IDF ranking mechanism for HierKV to rank the search results with considering hierarchy. 3.1 Index structure Inverted index is used to index the hierarchical data in HierKV to accelerate full-text keyword search. Each unique term in the value of data records is indexed. Index records are stored in the dedicated index tables to keep the link between indexed terms and original data records. Here we

4 1182 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) use data record to indicate the original data in the data tables, and index record to indicate the index of original data in the index tables. The hierarchy information of each term in the data record is maintained as relative path in the index record. Similar to the data record, the index record consists of a key and value as well. The key of the index record is the stemmed version of terms. The value of the index record is organized in hierarchies, each with one or more attributes and the corresponding values. Besides the relative path indicating hierarchy information, other necessary information (term frequency, document frequency, etc.) is also maintained in the index record for ranking purpose. Fig. 3 shows how the example data record in Fig. 2 is indexed with the key. The index record maintains the link between actual term and all data records in table Posts that include the actual terms, and the keys of these data records are stored in the value. The table Posts has total 28 data records including actual term. One of the original data records is post , where actual term appears 4 times in different relative path/hierarchies. Term appears in the position 4, 7, 2, and 1, respectively in relative path T opic,, Reply2, and Reply2 Reply1. The relative path includes attribute names of all family hierarchies of term that indicates the hierarchy information of in the data record. By keeping the hierarchy information in the indexed records, the system can support search within specific hierarchy. DF post_ post_ TF Topic 4 4 Reply2-7 2 Reply2- Reply1-1 Fig. 3: Example index record for HierKV 3.2 Search with hierarchical KVS To accelerate search with the secondary attributes, a search tree can be constructed by sorting and linking all index records to each other. Each index record is a node in the tree, either root node, intermediate node, or leaf node. Typical search tree for HierKV can be Binary Tree [9], B Tree [10], B+ Tree [11], etc. A search starts from the root node, and then recursively traverses to the leaf nodes until the query keyword matches with the key of node (index record). Before searching, the query keywords are split into individual terms and each term is stemmed. Then these stemmed terms are matched with the keys of nodes (index records) when traversing the search tree. After finding the index record whose key matches with the stemmed term, the value of index record is checked to get the keys of data records that include the actual terms. Besides keyword, the query command can include relative path that specifies which hierarchy the matched actual term should reside in. When checking the value of index record, the relative path of the actual term needs to be matched with the path in query. Let s take query command: { search : { keyword : { $has :, path : { $equal : T opic as an example. HierKV will first find the index record whose key matches with the term. Then the value of this index record is examined. Only the keys of data records including actual term in the relative path T opic will be extracted as search results. HierKV supports phrase search by utilizing the position information of the actual terms kept

5 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) in the index records. Since keywords appear contiguously in the phrase, the returned keys of data records should contain the adjacent terms in the same hierarchy. For example, a query for keywords ip hone requires returning the keys of data records that contain adjacent terms and ip hone in the same hierarchy. HierKV will match the keyword for and ip hone, and check the hierarchy and position information for these terms. S- ince these two terms appear in the contiguous positions in hierarchy T opic,, and Reply2 in data record post , it will be returned as one of the search results. 3.3 Hierarchical TF-IDF ranking mechanism Ranking mechanism plays an important role in a search system, which decides how relevant the retrieved data are related to the query. Many search systems have used TF-IDF ranking mechanism [12-13] to calculate scores for retrieved data records. In the literature, concerns have arisen regarding the inappropriate term weighting on the final score. [14-15] have proposes term weighting schemes to improve the TF-IDF score for document retrieval. We propose a hierarchical TF-IDF ranking mechanism for HierKV to facilitate data retrieval with consideration of hierarchy. The score of each retrieved data record is decided by various parameters: term frequency, document frequency, hierarchy factor, etc. As keyword matches in different hierarchies have different degrees of relevance to the query, TF of each term is normalized with a hierarchy factor to improve the accuracy of TF-IDF score. In HierKV, Eq. (1) defines the enhanced TF-IDF score s t,d,k for matching of term t in data record d at a specific hierarchy k: s t,d,k = h t,k tf idf t,d,k = h t,k tf t,d,k log N df t (1) Where h t,k is the hierarchy factor indicating that term t occurs in the k th hierarchy within data record d, df t is the document frequency that expresses the total number of data records containing term t in the data collection, tf t,d,k is the term frequency showing the number of times term t occurring in the k th hierarchy in data record d, and N is the total number of data records in the data collection. The hierarchy factor h t,k is a function that can properly reflect the relevance degree of the hierarchy to the query. Considering there may be multiple matches of term t in data record d and each match may have different hierarchy factor h t,k, the accumulated TF-IDF score S t,d for term t in data record d is defined in Eq. (2): S t,d = k s t,d,k = k (h t,k tf t,d,k ) log N df t (2) With the hierarchical TF-IDF ranking mechanism, the search results can reflect more accuracy of relevance to the query than ranking with traditional TF-IDF scores. Fig. 4 shows the pseudo code of a search in HierKV including both searching and ranking. 4 Experiments and Evaluation We have conducted experiments to evaluate performance for the proposed HierKV system. Two kinds of performance are evaluated: searching and ranking. The first experiment aims to evaluate

6 1184 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) Q: query command including a list of keywords {K 1, K 2, K n R: query result including a list of data record keys R = ; for (each keyword K i in Q) { stem K i into stemmed term t i ; while (search the index tree) { match t i with the key of each index record; if (t i matched with the key of index record r i ) { check the value part of index record r i ; get keys of data records {d i1, d i2, d im each containing the actual term k i ; for (each data record key d ij ) { get the value of term frequency tf and document frequency df; calculate the accumulated TF-IDF score: N s ; i,j hi,k tfi,j, k log k dfi merge {d i1, d i2, d im with R by intersection; for (each data record key d j in R) { calculate the ranking score: s j s i, j ; i rank the data record keys in R with descending order of ranking score s j ; return the ranked result R; Fig. 4: Search and rank of hierarchical KVS in HierKV P recision and Recall for HierKV system when searching with keywords. P recision and Recall are expressed in Eq. (3) and (4), respectively. P recision = Recall = {RelevantData {RetrievedData {RetrievedData {RelevantData {RetrievedData {RelevantData (3) (4) The collection of test data is downloaded from CACM that includes 3204 records of articles from the Communications of the ACM ( ). We sorted all records by size and chosen the first 100 biggest records for testing. Each record contains several fields about the article (e.g. title, abstract, authors, etc.). We formatted each record to a two-hierarchy structure, and selected five most common terms computer, program, storage, structure, and system as query keywords. All five query commands were issued to the search system with hierarchical TF-IDF ranking mechanism. One of the commands is: { search : { keyword : { $has : computer. The accumulated TF-IDF score (Eq. (2)) is used to rank the search results. Here the hierarchy factor for hierarchy i is defined as: h i = i 1, which reflect the higher hierarchies having higher degrees of relevance to the query. After searching and ranking, five query results are returned by HierKV system. We have verified the query results with the test data, and found that the retrieved data set are exactly the same to the relevant data set for all query commands. It means the HierKV system can achieve the highest value 1 for both P recision and Recall for all queries. Another experiment is used to evaluate the ranking efficiency of the proposed hierarchical TF- IDF ranking mechanism. It contains 16 test data each with two hierarchies as shown below. Term computer appears in either the first or the second hierarchy with different occurrences. Two query commands with keyword computer are issued to the HierKV system, while the

respectively. The accumulated TF-IDF score (Eq. (2)) is calculated with the hierarchy factor for hierarchy i defined as: h i = i 1.

7 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) proposed hierarchical TF-IDF ranking mechanism and normal TF-IDF ranking mechanism are used (with Hierarchy set to 1 and 0) respectively. The accumulated TF-IDF score (Eq. (2)) is calculated with the hierarchy factor for hierarchy i defined as: h i = i 1. After searching and ranking, the results returned by the HierKV system for the two query commands are as follows. From the results, we can find that the sequences of the returned keys are different for these two queries. The hierarchical TF-IDF ranking mechanism has given higher rank to the data with more keyword matches in the higher hierarchy, which reflects different hierarchies having different degrees of relevance to the query. The normal TF-IDF ranking mechanism ranks data just according to the keyword occurrences, without properly reflecting different degrees of relevance of different hierarchies to the query. 5 Conclusions In this paper, we have proposed a searchable key-value store HierKV to provide rich search features and fast data retrieval. The correlated data are grouped and organized in a hierarchical data object in HierKV to speed up data retrieval by attributes. HierKV indexes data objects

8 1186 Y. Zhu et al. /Journal of Computational Information Systems 11: 4 (2015) by the secondary attributes and keep the hierarchy information in the index to facilitate rich search features. A hierarchical TF-IDF ranking mechanism has been proposed for HierKV to apply hierarchy factors to the TF-IDF scores, thus to properly reflect different relevance degrees of keyword matching in different hierarchies. Experiments have been conducted to test the HierKV search system and evaluate the hierarchical TF-IDF ranking mechanism. According to the experimental results, the proposed HierKV system has achieved good performance with relatively high value of P recision and Recall. Compared to normal TF-IDF ranking mechanism, our hierarchical TF-IDF ranking mechanism can properly reflect different hierarchies having different degrees of relevance to the query. References [1] F. Chang, etl. BigTable: A Distributed Storage System for Structured Data. In Proc. of OSDI, Nov. 2006, pp [2] G. DeCandia, etl. Dynamo: Amazon s Highly Available Key-Value Store. In Proc. of SOSP, Oct. 2007, pp [3] A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev., Apr. 2010, Vol. 44, No. 2, pp [4] Project Voldemort, [5] Apache HBase, [6] R. Escriva, B. Wong, and E. G. Sirer. Hyperdex: A Distributed, Searchable Key-Value Store. In Proc. of ACM SIGCOMM, Aug. 2012, pp [7] M.T. Najaran and N.C. Hutchinson. Innesto: A Searchable Key/Value Store for Highly Dimensional Data. In Proc. of IEEE CloudCom, Dec. 2013, pp [8] Bin Liang, Yiqun Liu, Min Zhang, Shaoping Ma, Liyun Ru, Kuo Zhang. THUIR-DB: A Largescale, Highly-efficient Index, Fast-access Key-value Store. Journal of Computational Information Systems 9: 6 (2013), pp [9] Rowan Garnier and John Taylor. Discrete Mathematics: Proofs, Structures and Applications. Third Edition, CRC Press, [10] Douglas Comer. The Ubiquitous B-Tree. ACM Computing Surveys, Vol. 11, Issue 2, June 1979, pp [11] Ramez Elmasri and Shamkant B. Navathe. Fundamentals of database systems. 6th Edition, Addison- Wesley, [12] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, [13] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18 (11): , Nov [14] Jiaul H. Paik. A Novel TF-IDF Weighting Scheme for Effective Ranking. In Proc. of ACM SIGIR, July 2013, pp [15] Ho Chung WU, etl. Interpreting TF-IDF Term Weights as Making Relevance Decisions. ACM Transactions on Information Systems, Vol. 26, No. 3, Article 13, June 2008.

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert