A HBase Secondary Index Method Based on Isomorphic Column Family

Size: px

Start display at page:

Download "A HBase Secondary Index Method Based on Isomorphic Column Family"

Job Bennett
5 years ago
Views:

1 A HBase Secondary Index Method Based on Isomorphic Column Family Yudong Guo 1, Shenglin Li 1, Heng Zhang 2, Wei Zhong 1* 1 Military Logistics & Information Engineering Department, Logistical Engineering University, Chongqing401331, China; 2 Hiorigin Technology Co., Ltd., Chongqing400000, China Abstract The HBase database only supports the rowkey index method, and index could not be constructed on a non rowkey field. To solve this problem, this paper proposes a general HBase secondary index method based on Isomorphic column family(icf-hbase), which construct key-value index in an index family of the original data table, in order to realize the single key index, multi key index, data import and optimized query mechanism. On the large scale dataset, the performance experiment of data writing, data quantity and data query is designed. The experimental results show that compared with the standard HUAWEI Hindex, the improved ICF-HBase index method is more efficient in data writing and query, and the insertion speed of ICF-HBase index is faster than Hindex, and the index space is only about 60% of Hindex. In this paper, the ICF-HBase index method has great significance on large scale data query of HBase. Keywords: HBase, secondary index, isomorphic column family, big data query 1. INTRODUCTION With the rapid development of Internet and information technology, the scale of data in various fields has exploded. Data is no longer small, static and isolated, but large-scale, dynamic and distributed. Retrieving useful information from massive data is a challenging task(janarthanan et al., 2015). Because of the huge data size and data type of the complex, it is difficult for the traditional database technology to solve the problem of how to meet the demand of efficient storage and management. It is also difficult for the single server environment to satisfy the performance overhead problem of building large-scale data index in the search engine. The arrival of the era of big data has changed the traditional way of data storage and retrieval. HBase as a representative of a non relational database, with high performance, scalable, distributed column storage, and real-time reading and writing characteristics, can deal with large data, which makes up the deficiency of the traditional database. However, there are still some problems in HBase, such as the query pattern that supports only the rowkey index, the lack of support for multi dimensional complex query, and the lack of support for cross transaction. Therefore, in many scientific research and practical work, many researchers are focused on how to improve the performance of HBase multidimensional data query. A lot of researches have been done to improve the retrieval performance of HBase by optimizing the structure of HBase schema and the design of rowkey. However, these methods also bring some additional problems, such as the impact of the original region load balancing, data and index consistency and so on. Based on the above problems, in this paper, some HBase secondary index related research were compared and analyzed, based on the "table column cluster column value" HBase standard data model, a kind of HBase secondary index method based on isomorphic column family (ICF-HBase) is put forward. The design strategy is to construct one or more index column family in the data table, and stores the index column and description information as key-value pairs in column families, to maintain isomorphism between the original data and the index column. Based on this structure, data processing index algorithm for data writing and query process is designed. According to the commonly used query operations, three aspects of a single non rowkey value query, multi columns value query and range query are optimized. The experimental results show that the ICF-HBase method proposed in this paper has a beneficial effect on the dynamic construction of index and the consistency of data and index. 158

2 In this paper, the basic theory, methods and algorithms of HBase secondary index are studied. The main work and achievements of this paper are as follows: The current mainstream HBase index technology is analyzed and their advantages and disadvantages are compared. A kind of isomorphic column family HBase index method(icf-hbase) is put forward, which establish a single index column family in the original data table. An index of columns key and value pattern is designed, based on which the method of data writing and query is implemented. According to the actual application, the query strategy is further optimized, and the design of multi key index, the query expression conversion, and multi interval scanning is achieved. A large scale dataset index experiment is done. The results show that the ICF-HBase index method proposed in this paper is more effective than the HUAWEI Hindex in index construction and data query. With the increase in data size, the query efficiency is also improved. 2. CORRELATIONAL RESEARCH HBase as the mainstream NoSQL database, the main method is similar to the hash table to store and retrieve data, providing basic data manipulation (such as Put, Get, Scan). HBase data is written to the client submitted data into a Put object to the server(hbase, 2016). HBase data query is achieved using Get and Scan operations, there are 3 main ways: to specify a single row key query, the scope of the specified line key query and a full table scan. At present, HBase is not supported on non key attributes of the query, can only be achieved through a full table scan mode based on single row keys and specify conditions row key query is more costly. The time complexity for key retrieval based on O log N, if the use of BloomFilter can reach O 1, and the time complexity is O N. The application of large data record size in million to more than one hundred million, the non primary key scanning time overhead of the entire table is not acceptable, which leads to the HBase is difficult to meet the real-time query demand in many practical applications. Since the HBase itself supports only the index method of row key rowkey, it is only possible to find the whole row record through rowkey. Table 1 lists the current major solutions. Table 1 Comparison of the main solutions of HBase multi dimensional query Num Solution Limitations 1 Multiple fields as RK, write multiple A large number of storage resources are consumed when the index column is more 2 Index fields are combined as The design is complex, not flexible, poor scalability RK 3 Parallel scan filtering without indexing Consume a lot of computing resources, no effectiveness In order to improve the performance of HBase non primary key query, some research work is devoted to the construction of HBase secondary index. For large data based on non primary attribute selection queries (including single point queries and range queries) demand, HUAWEI research and development and open source non primary index query system Hindex(Huawei, 2016). It does not take the way to build the index table, but the use of local index model based on, through the HBase table of each to establish a separate index table to improve the efficiency of non primary key queries. Query requests are sent to each server, and then filtered through the index table on the and returned to the result set. Hindex queries need to access all the, for the data is relatively concentrated in the same region, it will consume a lot of search resources. Moreover, the data and index are stored in two data tables, and the data may be inconsistent with the index when the data are written. In addition, data splitting and aggregation operations also need to calculate the original data points to ensure that the data and index split in the same region. So the way to build an index table alone will produce some column problems. Lily HBase Indexer is a part of the Lily system developed by NGDATA company. It uses SolrCloud to store HBase index data, when HBase write, update or delete operation, the Indexer abstract operation into a series of 159

3 Event events through the HBase replication function, to ensure the consistency of the HBase index data written in Solr(Ngdata, 2016). And Indexer supports user-defined extraction, conversion rules to index HBase column data. The Solr search results will contain user-defined column family:qualifier field results, so that the application can access the HBase column data directly. Moreover, the Indexer index and search will not affect the stability of HBase operation and the throughput of HBase data, because the index and the search process are completely separate and asynchronous. This kind of indexing mechanism periodically updates the index, and the timeliness of the index is low. It is sometimes difficult to meet the demand for real-time applications.zhanget al. proposed HbaseSpatial, which is a scalable data storage mechanism based on HBase, compared with MongoDB and MySQL, the experimental results show that this method can effectively improve the spatial data query speed, and provides a good solution for storage, but this method does not compare with the distributed other index method(zhanget al., 2014). GE Wei, et al. proposed a layered non primary index storage model based on HBase, and designed and implemented a hierarchical index query system (HiBase)(Ge et al., 2016). HiBase stores the user table and the index table on HBase, and stores the hot data in the index table in memory. Memory based index model can greatly improve the efficiency of HBase index, but the current large-scale data cache management mechanism is more complicated, and consumes a lot of memory resources. ZHANG Chong and partners catch the point of view of the time dimension index, and design the structure of HST by using meta, and proposed a range query and knn query algorithm, as well as the corresponding parallel algorithm. However, there is no universal query strategy for keyword query and range query in other dimensions and time dimensión(zhanget al., 2016). The research team of China Academy of Sciences proposed multi-dimensional range query program based on non key named CCIndex, which is deployed in the distributed ordered table to provide high performance, low cost and space index. And the prototype system is implemented on HBase and Cassandra respectively(zou et al., 2010). The main idea is to make full use of multiple copies of the data in each copy were built in different non key attributes of the cluster index based on non primary key queries into large random read and sequential scan index based on the primary key table. CCIndex achieves significant performance improvement over the multidimensional non primary key range query, while optimizing the space overhead of the index. ZHANG Yu, et al. proposed a HBase spatial index structure based on SK-HBase text data, it uses HBase as data storage, data distribution through the effective strategy of spatial information and text information of text space objects at the same time index. On the basis of SK-HBase, two kinds of spatial keyword query algorithms are proposed to ensure the efficiency and scalability of spatial keyword queries in different spatial ranges. The experimental results show that this method can be used to query spatial data efficiently and has good scalability(zhanget al., 2012). 3. INTRODUCTION TO HBASE INDEX 3.1 HBase data model The table in the HBase database stores a series of rows, each row is composed of three basic elements, namely the Rowkey, Timestamp and Colunm. The Rowkey is equivalent to the primary key in the relational database, identifies different row records, and the system automatically sorts the rows in the order of the dictionary(george, 2011). Timestamp is a record for which different versions of Records label, recorded different versions of the data, in order to indicate each data operation. Column is defined as<family>:<qualifier> form, including Family and Qualifier two parts. Family is usually referred to as a cluster, the cluster is a centralized storage unit, the general will be frequently accessed data in the same cluster, in order to improve the efficiency of access, the number of the definition of the creation of the table. The Qualifier is a column in the column. Its number is not limited, can be unlimited dynamic increase. The traditional relational database, only needs to know the name of the data storage can be determined, while in HBase and Family and Qualifier need to know can only determine a column of data storage. To construct the HBase index, four part information must be included: rowkey, column family name, column key and timestamp. 160

4 Only the rowkey, <Family>:<Qualifier> and timestamp can determine a data unit value in a table. Data in HBase have no type, which is stored in binary format. The user in the data storage, each line has a sort of primary and any number of columns, but each line of data can have no columns. Each row has the same column cluster, but may have different columns. There is a great difference between HBase and traditional database, which is based on the column storage, there is no correlation between tables, each table is a separate table, no foreign key constraint(taylor, 2010). 3.2 Hbase index The establishment of a column index in HBase is very easy, because HBase itself provides the index function from the rowkey to the whole row, then the mapping we need to do is to establish the value of the column to the rowkey, you can solve this problem. The core idea of index construction is to establish such a mapping relation. As shown in Figure 1, in a data table, there is a column family (cf1), as well as the column under the two columns cf1:col1 and cf1:col2. First of all, the establishment of the cf1:col2 index of this column, and then in the index table to find the value of rowkey under cf1:col2= conditions, the final return of the rowkey= can be found in the original table of the entire record. rowkey RK1 RK3 RK4 cf1:col1 v11 v13 v14 cf1:col2 v21 v23 v24 Index Table rowkey cf1:col1 cf1:col2 v21 RK1 RK1 v11 v21 v23 RK3 RK3 v13 v23 v24 RK4 RK4 v14 v24 Figure 1 Schematic diagram of a column index. 4. ISOMORPHIC COLUMN FAMILY INDEX METHOD 4.1 Overall design Figure 2 displays the overall schematic diagram of the server and client end of the HBase two index. In HBase, is the foundation of data storage and management in HBase. A table can contain one or more. Each can only be provided by a RS (Server) service. RS can serve multiple at the same time, from different RS of the into a table of the overall logical view. Client Hmaster/Balancer META Table RowKey talbe,key,region Value Server Server Server rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 Figure 2 Distributed, concurrent data indexing and query schematic construction process. 161

5 (1) Client query process When querying data, client terminal will send request to all region for each table, each region in the Server on the data retrieval and query, and the query results obtained, the final results are returned to the client terminal, client terminal according to the acquired from multiple Server retrieval the data, the results are combined and sorted, get the final query results. (2) Index building process When indexing, client will write data to send a request to a region, the region index according to a list of instructions in Server indexed and stored in the Server, and then the results are returned to the client terminal operations. Whether the process of data query or index, are distributed and concurrent process. 4.2 Index design In order to clearly express the design process of the index, a brief introduction is granted here. In the previous section, the data in a table are distributed in two regions. When the new data is written to the database, the system detects that the data column col2 has been set up in the index description, and the index data is the corresponding data in the cf_idx column. As shown in Figure 3, a new column cluster name, cf_idx, is used to store the index information in the original data table. Then all the index of original data in the region index will be stored into the region itself, using a single column family store, named col_idx, which is not allowed to name. IDX_name => cf1:col2 -> Int rowkey RK1 cf1:col1 v11 cf1:col2 v21 cf1 1 Table 1.StartKey+IDX_name+v21+RK1 1.StartKey+IDX_name++ cf_idx RowKey col1 cf1 col2 cf_idx col_idx rowkey RK3 RK4 cf1:col1 v13 v14 cf1:col2 v23 v24 cf1 2 2.StartKey+IDX_name+v23+RK3 2.StartKey+IDX_name+v24+RK4 cf_idx (1) Generating rules for index key Figure 3 Schematic diagram of the HBase two index. Index Key is the key to keep the data and index to be isomorphic. In order to improve the efficiency of spatial non primary key field query as much as possible, this paper proposes the rule of key generation on ICF-HBase. The serialization of the index Key is represented as: N. StartKey + IDX_name + value + rowkey (1) Among them, IDX_name is an extensible object collection cf m : col n type, value and IDX_name is the m n corresponding extensible string i=1 j=1 v ij. For an indexed column, logic is key to {key:value}, and related key consists of four parts: information. In order to ensure that the index and the original data can be stored in the same interval (in the same region), the index prefix must start with region startkey; Index information. Add index name IDX_name to identify which column index; 162

6 Isomorphic column values information. Appends the column value corresponding to the index name to the index name to identify the index of the specified column and the corresponding column value; Isomorphic row key information. The rowkey is stored in the back of the column value to identify the row corresponding to the column value. In order to be able to parse the information stored in the index key, we use the value to store the key index. So after the organization, any query index can be transformed into a scan, but only for the cf_idx index column scan, you can specify startrow and endrow, will greatly reduce the cost of the query. (2) Index construction method According to the complexity of the query, index construction of the main mode of a single key index and multiple key index in two ways. Figure 4(a) and 4(b) shows, respectively show the way to build a single key index and multi key index. (a) single column Index rowkey RK1 cf1:col1 v11 cf1:col2 v21 IDX_name => cf1:col2 -> Int StartKey+IDX_name+v21+RK1 StartKey+IDX_name++ rowkey RK1 cf1:col1 v11 cf1:col2 v21 (b) multi column Index IDX_name => cf1:col2 -> Int, cf1:col2 -> Int StartKey+IDX_name+v11+v21+RK1 StartKey+IDX_name+++ Figure 4 Construction of a single key index and multi key index. According to equation(1), multiple combined index Key is based on a single index on the index of IDX_name and value respectively on the additional column names and values. The index column must map a data type, the paper designs the most types of data and queries, as shown in table 2.Among them, the text type data use fuzzy query(kavitha et al., 2014). 4.3 Data import Table 2 Index types and query methods Index type Length Query methods Char 2 equivalence, range query Byte 1 equivalence, range query Short 2 equivalence, range query Int 4 equivalence, range query Long 8 equivalence, range query Float 4 equivalence, range query Double 8 equivalence, range query String Unlimited exact match, range query Text Unlimited fuzzy query Multi index Unlimited Index optimization After constructing the data index, some changes will occur when the data are written. The basic process is as follows, when the client submits a Put object, Put object includes first determine which columns, then according to the index that determine which columns constructed index, then constructs the index data into a Put object, finally submitted to the Put object. This design method can simplify the process of importing data, because the data and the data index in the same Put object, the results will be submitted to all the success or failure of all, inconsistency will not appear on data and index. The data import process is shown in figure

7 Client Put Data Index Load Index Info Put MemStore Figure 5 Data import process. Algorithm 1 describes in detail the data written at the same time, according to the design rules of the index, to build an index object, added to the list of Put objects, and finally the Put object list submitted together. Specific algorithms are as follows: Algorithm 1 ICF-HBase data write and index construction algorithm Input: client Put object P(RK, C, V) where the RKis the rowkey, C is a specified column in the column family, C = cf: col 1,, cf: col i,, cf: col n, V is the value of each column in the C, V = value 1,, value i,, value n Output: transaction operation result TransResult 1. IndexInfo loadindexinfo(c) 2. ZRange calculaterange(c, IndexInfo) 3. for info i in IndexInfo do 4. key Zrange.startKey + info i.name + cf: col i + value i 5. value serializeindexinfo(key) 6. {cf_idx: col_idx} generateidxone(key, value) 7. PutList.add({cf_IDX: col_idx}) 8. end for 9. posttomemstore(putlist) 4.4 Data query The process of data reading is slightly more complex than data writing, requiring three steps. The first step is to create a scanner that specifies a condition as a condition for reading data. Here, this article extends the scan object of the HBase itself, enabling it to support condition's set operations. When region server received a client query request, according to the condition to build a search tree. The relationship between the search tree and the query condition is a logical, or, or a non-relation, and a logical relation in the condition will be converted to a node in the search tree. The construction of the whole search tree is carried out in the index column family. The second step, each leaf node to query the data through the HBase storescanner query data range to be designated startrow and endrow, in the search tree through the logic operation results will merge, find the satisfied result set rowkey. In order to improve the query efficiency, batch can be set up for batch operation. The third step, according to the rowkey results set, a seek operation in the original data in the column family to find the data, find the corresponding rowkey line dataset, returned to the client. The data query process is shown in figure

8 ResultSet Client Query: col1 Condition: col2=v21 col3=v32 Set Condition Create Scanner Column Family Index Leaf Store Scanner OR Leaf Store Scanner Rowkey List Seek Store File Scanner Column Family Data Store Scanner Store File Scanner startrow:sk+idx1+v21 endrow: SK+IDX1+v21' startrow:sk+idx1+v32 endrow: SK+IDX1+v32' Figure 6 Data query process. Algorithm 2 describes in detail the process of data query, the construction of the index tree, determine the location of the seek operation, retrieve the rowkeylist, and finally get the data in the original data. Specific algorithms are as follows: Algorithm 2 ICF-HBase parallel index data query algorithm Input: the client query field list collist, query condition object Condition, collist = cf: col 1,, cf: col i,, cf: col n, Condition = con 1,, con i,, con n, Relation, con = cf: col 1 : flag 1 : value 1,, cf: col i : flag i : value i,, cf: col n : flag n : value n =,, <,,, >, represents a different column and column value calculation, Relation = rel 12,, rel ij,, rel mn,, represents the logical relationship between each condition Output: query data result set ResultSet 1. Scanner scanner = new Scanner() 2. conlist getconlist(conditon) 3. for con i = cf: col i : flag i : value i in conlist do 4. position getposition(con i ) 5.IDXList i =List[IDX_key, IDX_value] scanner.seek(position) 6. end for 7. for rel ij in Relation do 8.IDXList ij calculaterelationidxlist(rel ij ) 9.IDXList all.add(idxlist ij ) 10.end for 11.rowkeyList generaterowkeylist(idxlist all ) 12.ResultSet scanner.seek(rowkeylist) 4.5 Query optimization strategy (1) Establish composite index At present, many enterprises have built decision support system(mansour, 2014).And, intheprocessofdesigningthesystem, UMLisused tobuild models(wanget al., 2016).It is necessary to set up a composite index, which is usually a logical relationship in the business requirement. It can improve the query efficiency by setting up a composite index of two columns.for example, mining the relationship between large scale user data and various business processes(weiet al., 2015). But as a special dimension, a lot of business level and query time, often need to query in a certain time range, so we will have a single default each time and set up a composite index. When the user in a column as a query, usually a specified time range, then by single and combined index of time, with only one scan operation can remove all rowkey, greatly improve the efficiency. 165

9 (2) Transformation of query expressions In the previous section, the relationship between logical and, or, non is an important node of the search tree. Among them, the logic and the relationship is often more complex, if the logic and operation of two large dataset, to complete traversal even a dataset, it will spend a lot of time and memory resources, and logic, or non relational operations tend to be more efficient. Therefore, in this paper, the query expression of the client is converted, the logic and the operation are given priority, and then the logic or the non operation is performed. The logic and operation can be achieved by constructing a composite index between multiple columns to achieve optimal query. For example, the query expression is A B && C D A&&C A&&D B&&C B&&D. (3) Multi interval scan query Many businesses will be encountered in the process of multi range query needs, such as a 1 < A < a 2 && b 1 < B < b 2. For this situation, the traditional process is to obtain the column A in the a 1, a 2 interval of the RK set RK_List1 and column B b 1, b 2 interval of the RK set RK_List2, and then take the intersection of RK_List1 and. This will bring some problems, such as A and B set anyone too large, the query time is very long, the intersection operation in memory, there will be a risk of memory overflow. Therefore, this article through the A and B to establish a joint index, support for multi range queries.by using the differential optimization algorithm, the global optimal solution set is calculated and the seek operation is optimized(vathi and Raju, 2015).The optimized seek process is shown in figure 7. start loop N value1.next N value2.next value1 in condition1 Y value2 in condition2 Y construct IDX key seek(key) end loop Figure 7 Optimizing the seek process. 5. EXPERIMENTAL ANALYSIS This section through a series of experiments to verify the efficiency of isomorphism column family HBase secondary index method, and the structure of this paper compared with the HUAWEI HIndex, which proves that the ICF-HBase index of the efficiency and scalability. 5.1 Experimental environment and dataset In order to test the performance of the proposed ICF-HBase, we use the Hadoop cluster environment to test the experiment, and the cluster has 10 nodes (including 1 master, 9 slaver). The node configuration of the cluster is shown in Table

Table 3 Computer node configuration information Name Configuration CPU Intel (R) Core (TM) 2.00GHZ i7 4510U CPU * 4 Memory 8GB Disk 1TB 7200RBM SATA Ⅱ OS UbuntKylin Linux 16.04 JVM Version Java 1.8.0 Hadoop Version Hadoop 2.

10 Table 3 Computer node configuration information Name Configuration CPU Intel (R) Core (TM) 2.00GHZ i7 4510U CPU * 4 Memory 8GB Disk 1TB 7200RBM SATA Ⅱ OS UbuntKylin Linux JVM Version Java Hadoop Version Hadoop HBase Version HBase ZooKeeper Version ZooKeeper In this paper, the test dataset and some real business dataset of Brown University are tested, in order to facilitate quantitative analysis, the dataset is divided into 10 parts, see Table Experimental results and analysis Table 4 Basic information of dataset Dataset Record number Size(MB) Dataset1 100, Dataset2 200, Dataset3 300, Dataset4 400, Dataset5 500, Dataset6 1,000, Dataset7 1, 500, Dataset8 2,000, Dataset9 2, 500, Real Dataset 100,000,000 10, (1) Performance comparison experiment and result analysis of data import 10,000,000 business data were used to test the test objects were HUAWEI Hindex and the text of the improved ICF-HBase index method, each 5 seconds to detect the number of data to write a few, the experimental results as follows. Figure 8 Comparison of data writing efficiency. It is shown that compared with the HUAWEI Hindex, ICF-HBase index method proposed in this paper to write data and index is the construction efficiency is higher, the average Hindex every 5 seconds to write data 4835, and the ICF-HBase index method every 5 seconds to write data can reach This is because the data are written, the original data submitted by ICF-HBase and index data can keep the isomorphism, stored in the same table, and Hindex data and index are divided into two parts to increase the amount of calculation, transaction operation. 167

(2) The effect of the amount of data on the performance of data writing and index construction In this part of the experiment, we tested the insertion rate and the size of the space of the ICF-HBase

11 (2) The effect of the amount of data on the performance of data writing and index construction In this part of the experiment, we tested the insertion rate and the size of the space of the ICF-HBase under different data, and compared with the standard HBase and the HUAWEI Hindex, the experimental results are presented in figure 9. Figure 9 (a) shows that in the different amount of data, the insertion speed of ICF-HBase index were faster than Hindex 30%-40%, and, with the increasing amount of data, the insertion speed of Hindex decreased significantly, and the insertion rate of ICF-HBase remained stable. The two are lower than the standard HBase data write speed. This is because when data writing in the ICF-HBase, the original data and index data are packaged to the Put object to submit. In the Hindex, the index data will also be submitted to in the index table, which is to some extent increase the amount of data to import. Index construction will produce some redundant data, has a lot of redundancy in the dataset size. As can be seen from Figure 9 (b), under the same amount of data, the disk space required by ICF-HBase is only about 60% of Hindex. At this time due to the construction of the index cluster pesos cited table to save space. Figure 9 (a). The effect of the amount of data on the insertion performance. Figure 9 (b).the amount of data affects the size of the index. (3) Query optimization strategy and result analysis on large scale dataset In order to test the influence of the number of keywords and the number of range query conditions on query performance, experiments were carried out on 100 million real dataset, comparison of ICF-HBase and HUAWEI Hindex two methods were tested on the HBase index method of standard HBase, Hindex and improved, the experimental results are shown in figure 10. The overall trend increases in the keyword and query conditions, the longer the query. By the comparison of Figure 10 (a) and 10 (b), it is found that the range query is longer than the keyword matching query, which is due to the large amount of computation in the range query, in addition to the equivalent calculation, there are more than, less than. Overall, ICF-HBase query time is shorter than 2-5 times in Hindex, and the increase of keywords and query conditions, the better the query performance of ICF-HBase. This is because the Hindex constructs a separate index table stored in each region, but in the query process for the index table and the original data are carried out by the scan operation, so its query efficiency. In this paper, the ICF-HBase method is proposed to set up the index in the way of cluster, and the original data is stored in the same table. The query data are equivalent to only one scan operation, so the query efficiency is very high. 6. CONCLUSIONS This paper aims to solve the problem of rowkey index query only supports HBase, through the establishment of a separate list of clusters in the original data table for storing index information, put forward a kind of isomorphism column two level index clustering method based on HBase (ICF-HBase). On the basis of this index, we extend the data import, data query and index optimization strategy. The ICF-HBase index optimization strategy established in this paper, support single equivalence, query the logical relationship between single columns, the logical relationship of equivalent query, and range query. In the performance test of data writing, data quantity effect and data query optimization, the HUAWEI Hindex and ICF-HBase methods were used to test the large scale data sets. The results show that the ICF-HBase index method proposed in this paper is more efficient than the HUAWEI Hindex in index construction and data query. With the increase in data size, the query efficiency is also improved. 168

12 Figure 10 (a).the number of keywords Figure 10 (a).the effect of the number of on the query time.query conditions on the query time. However, there are some disadvantages in the way of constructing index in this paper. When written in each data according to the index of the related columns in the data to construct index, so when the other line had the same column value index, the index will produce redundancy, increase the amount of calculation of the index generation, but also to a certain extent, a waste of storage space. In the future, the problem of index redundancy will be improved. ACKNOWLEDGMENTS This work is supported by Major logistics research projects (AS214R002). REFERENCES Ge W., Luo S.M., Zhou W.H., Zhao D., Tang Y., Zhou J., Qu W.W., Yuan C.F., Huang Y.H. (2016). HiBase: A Hierarchical Indexing Mechanism and System for Efficient HBase Query, Chinese Journal of Computers, 39(1), HBase (2016). Apache Software Foundation, Huawei (2016). Hindex-Secondary Index for HBase, Kavitha A., Rajkumar N., Victor S.P. (2014). An Integrated and Efficient Approach to Measure Semantic Similarity between Short Sentences and Paragraphs, Advances in Modelling and Analysis B, 57(2), Mansour E.A. (2014). A proposed Intelligent Decision Support System for Marketing Planning in Industrial Enterprises, Modelling, Measurement and Control D, 35(1), Ngdata (2016). Lily hbase-indexer, Janarthanan P., Rajkumar N., Padmanaban G., Yamini S. (2014). Performance Analysis on Graph Based Information Retrieval Approaches, Advances in Modelling and Analysis D, 19(1), Taylor R.C. (2010). An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics, BMC Bioinformatics, 11 Suppl 12(12), Doi: / S12-S1. Vathi T.V., Raju G.S.N. (2015). Pattern Synthesis using Modified Differential Evolution Algorithm, Measurement and Control A, 88(1), Wang T.C., Hu X.X., Zhong S.S., ZhangY.J. (2016). Research on Knowledge Base System Based On UML and JqueryEasyUI, Reviewof ComputerEngineering Studies, 3(2), Doi: /rces Wei R.G., Zhen J.G., Bao L.L. (2015). Study on Mining Big Users Data in the Development of Hubei Auto- Parts Enterprise, Mathematical Modelling of Engineering Problems, 2(4), 1-6. Doi: /mmep Zhang C., Chen X.Y., Shi Z.L., Ge B. (2016). Algorithms for Spatio-temporal Queries in HBase, Journal of Chinese Computer Systems, 37(11), Zhang N., Zheng G., Chen H., Chen J. (2014). HBaseSpatial: A Scalable Spatial Data Storage Based on HBase, IEEE, International Conference on Trust, Security and Privacy in Computing and Communications, Doi: /TrustCom Zhang Y., Ma Y.Z., Meng X.F. (2012). Efficient Processing of Spatial Keyword Queries on HBase, Journal of Chinese Computer Systems, 33(10), Zou Y., Liu J., Wang S., Zha L., Xu Z. (2010). CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries, Ifip International Conference on Network and Parallel Computing6289, Doi: / _22. George L. (2011). HBase: the definitive guide, Andre, 12(1),

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of