A HBase Secondary Index Method Based on Isomorphic Column Family

Size: px
Start display at page:

Download "A HBase Secondary Index Method Based on Isomorphic Column Family"

Transcription

1 A HBase Secondary Index Method Based on Isomorphic Column Family Yudong Guo 1, Shenglin Li 1, Heng Zhang 2, Wei Zhong 1* 1 Military Logistics & Information Engineering Department, Logistical Engineering University, Chongqing401331, China; 2 Hiorigin Technology Co., Ltd., Chongqing400000, China Abstract The HBase database only supports the rowkey index method, and index could not be constructed on a non rowkey field. To solve this problem, this paper proposes a general HBase secondary index method based on Isomorphic column family(icf-hbase), which construct key-value index in an index family of the original data table, in order to realize the single key index, multi key index, data import and optimized query mechanism. On the large scale dataset, the performance experiment of data writing, data quantity and data query is designed. The experimental results show that compared with the standard HUAWEI Hindex, the improved ICF-HBase index method is more efficient in data writing and query, and the insertion speed of ICF-HBase index is faster than Hindex, and the index space is only about 60% of Hindex. In this paper, the ICF-HBase index method has great significance on large scale data query of HBase. Keywords: HBase, secondary index, isomorphic column family, big data query 1. INTRODUCTION With the rapid development of Internet and information technology, the scale of data in various fields has exploded. Data is no longer small, static and isolated, but large-scale, dynamic and distributed. Retrieving useful information from massive data is a challenging task(janarthanan et al., 2015). Because of the huge data size and data type of the complex, it is difficult for the traditional database technology to solve the problem of how to meet the demand of efficient storage and management. It is also difficult for the single server environment to satisfy the performance overhead problem of building large-scale data index in the search engine. The arrival of the era of big data has changed the traditional way of data storage and retrieval. HBase as a representative of a non relational database, with high performance, scalable, distributed column storage, and real-time reading and writing characteristics, can deal with large data, which makes up the deficiency of the traditional database. However, there are still some problems in HBase, such as the query pattern that supports only the rowkey index, the lack of support for multi dimensional complex query, and the lack of support for cross transaction. Therefore, in many scientific research and practical work, many researchers are focused on how to improve the performance of HBase multidimensional data query. A lot of researches have been done to improve the retrieval performance of HBase by optimizing the structure of HBase schema and the design of rowkey. However, these methods also bring some additional problems, such as the impact of the original region load balancing, data and index consistency and so on. Based on the above problems, in this paper, some HBase secondary index related research were compared and analyzed, based on the "table column cluster column value" HBase standard data model, a kind of HBase secondary index method based on isomorphic column family (ICF-HBase) is put forward. The design strategy is to construct one or more index column family in the data table, and stores the index column and description information as key-value pairs in column families, to maintain isomorphism between the original data and the index column. Based on this structure, data processing index algorithm for data writing and query process is designed. According to the commonly used query operations, three aspects of a single non rowkey value query, multi columns value query and range query are optimized. The experimental results show that the ICF-HBase method proposed in this paper has a beneficial effect on the dynamic construction of index and the consistency of data and index. 158

2 In this paper, the basic theory, methods and algorithms of HBase secondary index are studied. The main work and achievements of this paper are as follows: The current mainstream HBase index technology is analyzed and their advantages and disadvantages are compared. A kind of isomorphic column family HBase index method(icf-hbase) is put forward, which establish a single index column family in the original data table. An index of columns key and value pattern is designed, based on which the method of data writing and query is implemented. According to the actual application, the query strategy is further optimized, and the design of multi key index, the query expression conversion, and multi interval scanning is achieved. A large scale dataset index experiment is done. The results show that the ICF-HBase index method proposed in this paper is more effective than the HUAWEI Hindex in index construction and data query. With the increase in data size, the query efficiency is also improved. 2. CORRELATIONAL RESEARCH HBase as the mainstream NoSQL database, the main method is similar to the hash table to store and retrieve data, providing basic data manipulation (such as Put, Get, Scan). HBase data is written to the client submitted data into a Put object to the server(hbase, 2016). HBase data query is achieved using Get and Scan operations, there are 3 main ways: to specify a single row key query, the scope of the specified line key query and a full table scan. At present, HBase is not supported on non key attributes of the query, can only be achieved through a full table scan mode based on single row keys and specify conditions row key query is more costly. The time complexity for key retrieval based on O log N, if the use of BloomFilter can reach O 1, and the time complexity is O N. The application of large data record size in million to more than one hundred million, the non primary key scanning time overhead of the entire table is not acceptable, which leads to the HBase is difficult to meet the real-time query demand in many practical applications. Since the HBase itself supports only the index method of row key rowkey, it is only possible to find the whole row record through rowkey. Table 1 lists the current major solutions. Table 1 Comparison of the main solutions of HBase multi dimensional query Num Solution Limitations 1 Multiple fields as RK, write multiple A large number of storage resources are consumed when the index column is more 2 Index fields are combined as The design is complex, not flexible, poor scalability RK 3 Parallel scan filtering without indexing Consume a lot of computing resources, no effectiveness In order to improve the performance of HBase non primary key query, some research work is devoted to the construction of HBase secondary index. For large data based on non primary attribute selection queries (including single point queries and range queries) demand, HUAWEI research and development and open source non primary index query system Hindex(Huawei, 2016). It does not take the way to build the index table, but the use of local index model based on, through the HBase table of each to establish a separate index table to improve the efficiency of non primary key queries. Query requests are sent to each server, and then filtered through the index table on the and returned to the result set. Hindex queries need to access all the, for the data is relatively concentrated in the same region, it will consume a lot of search resources. Moreover, the data and index are stored in two data tables, and the data may be inconsistent with the index when the data are written. In addition, data splitting and aggregation operations also need to calculate the original data points to ensure that the data and index split in the same region. So the way to build an index table alone will produce some column problems. Lily HBase Indexer is a part of the Lily system developed by NGDATA company. It uses SolrCloud to store HBase index data, when HBase write, update or delete operation, the Indexer abstract operation into a series of 159

3 Event events through the HBase replication function, to ensure the consistency of the HBase index data written in Solr(Ngdata, 2016). And Indexer supports user-defined extraction, conversion rules to index HBase column data. The Solr search results will contain user-defined column family:qualifier field results, so that the application can access the HBase column data directly. Moreover, the Indexer index and search will not affect the stability of HBase operation and the throughput of HBase data, because the index and the search process are completely separate and asynchronous. This kind of indexing mechanism periodically updates the index, and the timeliness of the index is low. It is sometimes difficult to meet the demand for real-time applications.zhanget al. proposed HbaseSpatial, which is a scalable data storage mechanism based on HBase, compared with MongoDB and MySQL, the experimental results show that this method can effectively improve the spatial data query speed, and provides a good solution for storage, but this method does not compare with the distributed other index method(zhanget al., 2014). GE Wei, et al. proposed a layered non primary index storage model based on HBase, and designed and implemented a hierarchical index query system (HiBase)(Ge et al., 2016). HiBase stores the user table and the index table on HBase, and stores the hot data in the index table in memory. Memory based index model can greatly improve the efficiency of HBase index, but the current large-scale data cache management mechanism is more complicated, and consumes a lot of memory resources. ZHANG Chong and partners catch the point of view of the time dimension index, and design the structure of HST by using meta, and proposed a range query and knn query algorithm, as well as the corresponding parallel algorithm. However, there is no universal query strategy for keyword query and range query in other dimensions and time dimensión(zhanget al., 2016). The research team of China Academy of Sciences proposed multi-dimensional range query program based on non key named CCIndex, which is deployed in the distributed ordered table to provide high performance, low cost and space index. And the prototype system is implemented on HBase and Cassandra respectively(zou et al., 2010). The main idea is to make full use of multiple copies of the data in each copy were built in different non key attributes of the cluster index based on non primary key queries into large random read and sequential scan index based on the primary key table. CCIndex achieves significant performance improvement over the multidimensional non primary key range query, while optimizing the space overhead of the index. ZHANG Yu, et al. proposed a HBase spatial index structure based on SK-HBase text data, it uses HBase as data storage, data distribution through the effective strategy of spatial information and text information of text space objects at the same time index. On the basis of SK-HBase, two kinds of spatial keyword query algorithms are proposed to ensure the efficiency and scalability of spatial keyword queries in different spatial ranges. The experimental results show that this method can be used to query spatial data efficiently and has good scalability(zhanget al., 2012). 3. INTRODUCTION TO HBASE INDEX 3.1 HBase data model The table in the HBase database stores a series of rows, each row is composed of three basic elements, namely the Rowkey, Timestamp and Colunm. The Rowkey is equivalent to the primary key in the relational database, identifies different row records, and the system automatically sorts the rows in the order of the dictionary(george, 2011). Timestamp is a record for which different versions of Records label, recorded different versions of the data, in order to indicate each data operation. Column is defined as<family>:<qualifier> form, including Family and Qualifier two parts. Family is usually referred to as a cluster, the cluster is a centralized storage unit, the general will be frequently accessed data in the same cluster, in order to improve the efficiency of access, the number of the definition of the creation of the table. The Qualifier is a column in the column. Its number is not limited, can be unlimited dynamic increase. The traditional relational database, only needs to know the name of the data storage can be determined, while in HBase and Family and Qualifier need to know can only determine a column of data storage. To construct the HBase index, four part information must be included: rowkey, column family name, column key and timestamp. 160

4 Only the rowkey, <Family>:<Qualifier> and timestamp can determine a data unit value in a table. Data in HBase have no type, which is stored in binary format. The user in the data storage, each line has a sort of primary and any number of columns, but each line of data can have no columns. Each row has the same column cluster, but may have different columns. There is a great difference between HBase and traditional database, which is based on the column storage, there is no correlation between tables, each table is a separate table, no foreign key constraint(taylor, 2010). 3.2 Hbase index The establishment of a column index in HBase is very easy, because HBase itself provides the index function from the rowkey to the whole row, then the mapping we need to do is to establish the value of the column to the rowkey, you can solve this problem. The core idea of index construction is to establish such a mapping relation. As shown in Figure 1, in a data table, there is a column family (cf1), as well as the column under the two columns cf1:col1 and cf1:col2. First of all, the establishment of the cf1:col2 index of this column, and then in the index table to find the value of rowkey under cf1:col2= conditions, the final return of the rowkey= can be found in the original table of the entire record. rowkey RK1 RK3 RK4 cf1:col1 v11 v13 v14 cf1:col2 v21 v23 v24 Index Table rowkey cf1:col1 cf1:col2 v21 RK1 RK1 v11 v21 v23 RK3 RK3 v13 v23 v24 RK4 RK4 v14 v24 Figure 1 Schematic diagram of a column index. 4. ISOMORPHIC COLUMN FAMILY INDEX METHOD 4.1 Overall design Figure 2 displays the overall schematic diagram of the server and client end of the HBase two index. In HBase, is the foundation of data storage and management in HBase. A table can contain one or more. Each can only be provided by a RS (Server) service. RS can serve multiple at the same time, from different RS of the into a table of the overall logical view. Client Hmaster/Balancer META Table RowKey talbe,key,region Value Server Server Server rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 rowkey cf1:col1 cf1:col2 RK1 v11 v21 Figure 2 Distributed, concurrent data indexing and query schematic construction process. 161

5 (1) Client query process When querying data, client terminal will send request to all region for each table, each region in the Server on the data retrieval and query, and the query results obtained, the final results are returned to the client terminal, client terminal according to the acquired from multiple Server retrieval the data, the results are combined and sorted, get the final query results. (2) Index building process When indexing, client will write data to send a request to a region, the region index according to a list of instructions in Server indexed and stored in the Server, and then the results are returned to the client terminal operations. Whether the process of data query or index, are distributed and concurrent process. 4.2 Index design In order to clearly express the design process of the index, a brief introduction is granted here. In the previous section, the data in a table are distributed in two regions. When the new data is written to the database, the system detects that the data column col2 has been set up in the index description, and the index data is the corresponding data in the cf_idx column. As shown in Figure 3, a new column cluster name, cf_idx, is used to store the index information in the original data table. Then all the index of original data in the region index will be stored into the region itself, using a single column family store, named col_idx, which is not allowed to name. IDX_name => cf1:col2 -> Int rowkey RK1 cf1:col1 v11 cf1:col2 v21 cf1 1 Table 1.StartKey+IDX_name+v21+RK1 1.StartKey+IDX_name++ cf_idx RowKey col1 cf1 col2 cf_idx col_idx rowkey RK3 RK4 cf1:col1 v13 v14 cf1:col2 v23 v24 cf1 2 2.StartKey+IDX_name+v23+RK3 2.StartKey+IDX_name+v24+RK4 cf_idx (1) Generating rules for index key Figure 3 Schematic diagram of the HBase two index. Index Key is the key to keep the data and index to be isomorphic. In order to improve the efficiency of spatial non primary key field query as much as possible, this paper proposes the rule of key generation on ICF-HBase. The serialization of the index Key is represented as: N. StartKey + IDX_name + value + rowkey (1) Among them, IDX_name is an extensible object collection cf m : col n type, value and IDX_name is the m n corresponding extensible string i=1 j=1 v ij. For an indexed column, logic is key to {key:value}, and related key consists of four parts: information. In order to ensure that the index and the original data can be stored in the same interval (in the same region), the index prefix must start with region startkey; Index information. Add index name IDX_name to identify which column index; 162

6 Isomorphic column values information. Appends the column value corresponding to the index name to the index name to identify the index of the specified column and the corresponding column value; Isomorphic row key information. The rowkey is stored in the back of the column value to identify the row corresponding to the column value. In order to be able to parse the information stored in the index key, we use the value to store the key index. So after the organization, any query index can be transformed into a scan, but only for the cf_idx index column scan, you can specify startrow and endrow, will greatly reduce the cost of the query. (2) Index construction method According to the complexity of the query, index construction of the main mode of a single key index and multiple key index in two ways. Figure 4(a) and 4(b) shows, respectively show the way to build a single key index and multi key index. (a) single column Index rowkey RK1 cf1:col1 v11 cf1:col2 v21 IDX_name => cf1:col2 -> Int StartKey+IDX_name+v21+RK1 StartKey+IDX_name++ rowkey RK1 cf1:col1 v11 cf1:col2 v21 (b) multi column Index IDX_name => cf1:col2 -> Int, cf1:col2 -> Int StartKey+IDX_name+v11+v21+RK1 StartKey+IDX_name+++ Figure 4 Construction of a single key index and multi key index. According to equation(1), multiple combined index Key is based on a single index on the index of IDX_name and value respectively on the additional column names and values. The index column must map a data type, the paper designs the most types of data and queries, as shown in table 2.Among them, the text type data use fuzzy query(kavitha et al., 2014). 4.3 Data import Table 2 Index types and query methods Index type Length Query methods Char 2 equivalence, range query Byte 1 equivalence, range query Short 2 equivalence, range query Int 4 equivalence, range query Long 8 equivalence, range query Float 4 equivalence, range query Double 8 equivalence, range query String Unlimited exact match, range query Text Unlimited fuzzy query Multi index Unlimited Index optimization After constructing the data index, some changes will occur when the data are written. The basic process is as follows, when the client submits a Put object, Put object includes first determine which columns, then according to the index that determine which columns constructed index, then constructs the index data into a Put object, finally submitted to the Put object. This design method can simplify the process of importing data, because the data and the data index in the same Put object, the results will be submitted to all the success or failure of all, inconsistency will not appear on data and index. The data import process is shown in figure

7 Client Put Data Index Load Index Info Put MemStore Figure 5 Data import process. Algorithm 1 describes in detail the data written at the same time, according to the design rules of the index, to build an index object, added to the list of Put objects, and finally the Put object list submitted together. Specific algorithms are as follows: Algorithm 1 ICF-HBase data write and index construction algorithm Input: client Put object P(RK, C, V) where the RKis the rowkey, C is a specified column in the column family, C = cf: col 1,, cf: col i,, cf: col n, V is the value of each column in the C, V = value 1,, value i,, value n Output: transaction operation result TransResult 1. IndexInfo loadindexinfo(c) 2. ZRange calculaterange(c, IndexInfo) 3. for info i in IndexInfo do 4. key Zrange.startKey + info i.name + cf: col i + value i 5. value serializeindexinfo(key) 6. {cf_idx: col_idx} generateidxone(key, value) 7. PutList.add({cf_IDX: col_idx}) 8. end for 9. posttomemstore(putlist) 4.4 Data query The process of data reading is slightly more complex than data writing, requiring three steps. The first step is to create a scanner that specifies a condition as a condition for reading data. Here, this article extends the scan object of the HBase itself, enabling it to support condition's set operations. When region server received a client query request, according to the condition to build a search tree. The relationship between the search tree and the query condition is a logical, or, or a non-relation, and a logical relation in the condition will be converted to a node in the search tree. The construction of the whole search tree is carried out in the index column family. The second step, each leaf node to query the data through the HBase storescanner query data range to be designated startrow and endrow, in the search tree through the logic operation results will merge, find the satisfied result set rowkey. In order to improve the query efficiency, batch can be set up for batch operation. The third step, according to the rowkey results set, a seek operation in the original data in the column family to find the data, find the corresponding rowkey line dataset, returned to the client. The data query process is shown in figure

8 ResultSet Client Query: col1 Condition: col2=v21 col3=v32 Set Condition Create Scanner Column Family Index Leaf Store Scanner OR Leaf Store Scanner Rowkey List Seek Store File Scanner Column Family Data Store Scanner Store File Scanner startrow:sk+idx1+v21 endrow: SK+IDX1+v21' startrow:sk+idx1+v32 endrow: SK+IDX1+v32' Figure 6 Data query process. Algorithm 2 describes in detail the process of data query, the construction of the index tree, determine the location of the seek operation, retrieve the rowkeylist, and finally get the data in the original data. Specific algorithms are as follows: Algorithm 2 ICF-HBase parallel index data query algorithm Input: the client query field list collist, query condition object Condition, collist = cf: col 1,, cf: col i,, cf: col n, Condition = con 1,, con i,, con n, Relation, con = cf: col 1 : flag 1 : value 1,, cf: col i : flag i : value i,, cf: col n : flag n : value n =,, <,,, >, represents a different column and column value calculation, Relation = rel 12,, rel ij,, rel mn,, represents the logical relationship between each condition Output: query data result set ResultSet 1. Scanner scanner = new Scanner() 2. conlist getconlist(conditon) 3. for con i = cf: col i : flag i : value i in conlist do 4. position getposition(con i ) 5.IDXList i =List[IDX_key, IDX_value] scanner.seek(position) 6. end for 7. for rel ij in Relation do 8.IDXList ij calculaterelationidxlist(rel ij ) 9.IDXList all.add(idxlist ij ) 10.end for 11.rowkeyList generaterowkeylist(idxlist all ) 12.ResultSet scanner.seek(rowkeylist) 4.5 Query optimization strategy (1) Establish composite index At present, many enterprises have built decision support system(mansour, 2014).And, intheprocessofdesigningthesystem, UMLisused tobuild models(wanget al., 2016).It is necessary to set up a composite index, which is usually a logical relationship in the business requirement. It can improve the query efficiency by setting up a composite index of two columns.for example, mining the relationship between large scale user data and various business processes(weiet al., 2015). But as a special dimension, a lot of business level and query time, often need to query in a certain time range, so we will have a single default each time and set up a composite index. When the user in a column as a query, usually a specified time range, then by single and combined index of time, with only one scan operation can remove all rowkey, greatly improve the efficiency. 165

9 (2) Transformation of query expressions In the previous section, the relationship between logical and, or, non is an important node of the search tree. Among them, the logic and the relationship is often more complex, if the logic and operation of two large dataset, to complete traversal even a dataset, it will spend a lot of time and memory resources, and logic, or non relational operations tend to be more efficient. Therefore, in this paper, the query expression of the client is converted, the logic and the operation are given priority, and then the logic or the non operation is performed. The logic and operation can be achieved by constructing a composite index between multiple columns to achieve optimal query. For example, the query expression is A B && C D A&&C A&&D B&&C B&&D. (3) Multi interval scan query Many businesses will be encountered in the process of multi range query needs, such as a 1 < A < a 2 && b 1 < B < b 2. For this situation, the traditional process is to obtain the column A in the a 1, a 2 interval of the RK set RK_List1 and column B b 1, b 2 interval of the RK set RK_List2, and then take the intersection of RK_List1 and. This will bring some problems, such as A and B set anyone too large, the query time is very long, the intersection operation in memory, there will be a risk of memory overflow. Therefore, this article through the A and B to establish a joint index, support for multi range queries.by using the differential optimization algorithm, the global optimal solution set is calculated and the seek operation is optimized(vathi and Raju, 2015).The optimized seek process is shown in figure 7. start loop N value1.next N value2.next value1 in condition1 Y value2 in condition2 Y construct IDX key seek(key) end loop Figure 7 Optimizing the seek process. 5. EXPERIMENTAL ANALYSIS This section through a series of experiments to verify the efficiency of isomorphism column family HBase secondary index method, and the structure of this paper compared with the HUAWEI HIndex, which proves that the ICF-HBase index of the efficiency and scalability. 5.1 Experimental environment and dataset In order to test the performance of the proposed ICF-HBase, we use the Hadoop cluster environment to test the experiment, and the cluster has 10 nodes (including 1 master, 9 slaver). The node configuration of the cluster is shown in Table

10 Table 3 Computer node configuration information Name Configuration CPU Intel (R) Core (TM) 2.00GHZ i7 4510U CPU * 4 Memory 8GB Disk 1TB 7200RBM SATA Ⅱ OS UbuntKylin Linux JVM Version Java Hadoop Version Hadoop HBase Version HBase ZooKeeper Version ZooKeeper In this paper, the test dataset and some real business dataset of Brown University are tested, in order to facilitate quantitative analysis, the dataset is divided into 10 parts, see Table Experimental results and analysis Table 4 Basic information of dataset Dataset Record number Size(MB) Dataset1 100, Dataset2 200, Dataset3 300, Dataset4 400, Dataset5 500, Dataset6 1,000, Dataset7 1, 500, Dataset8 2,000, Dataset9 2, 500, Real Dataset 100,000,000 10, (1) Performance comparison experiment and result analysis of data import 10,000,000 business data were used to test the test objects were HUAWEI Hindex and the text of the improved ICF-HBase index method, each 5 seconds to detect the number of data to write a few, the experimental results as follows. Figure 8 Comparison of data writing efficiency. It is shown that compared with the HUAWEI Hindex, ICF-HBase index method proposed in this paper to write data and index is the construction efficiency is higher, the average Hindex every 5 seconds to write data 4835, and the ICF-HBase index method every 5 seconds to write data can reach This is because the data are written, the original data submitted by ICF-HBase and index data can keep the isomorphism, stored in the same table, and Hindex data and index are divided into two parts to increase the amount of calculation, transaction operation. 167

11 (2) The effect of the amount of data on the performance of data writing and index construction In this part of the experiment, we tested the insertion rate and the size of the space of the ICF-HBase under different data, and compared with the standard HBase and the HUAWEI Hindex, the experimental results are presented in figure 9. Figure 9 (a) shows that in the different amount of data, the insertion speed of ICF-HBase index were faster than Hindex 30%-40%, and, with the increasing amount of data, the insertion speed of Hindex decreased significantly, and the insertion rate of ICF-HBase remained stable. The two are lower than the standard HBase data write speed. This is because when data writing in the ICF-HBase, the original data and index data are packaged to the Put object to submit. In the Hindex, the index data will also be submitted to in the index table, which is to some extent increase the amount of data to import. Index construction will produce some redundant data, has a lot of redundancy in the dataset size. As can be seen from Figure 9 (b), under the same amount of data, the disk space required by ICF-HBase is only about 60% of Hindex. At this time due to the construction of the index cluster pesos cited table to save space. Figure 9 (a). The effect of the amount of data on the insertion performance. Figure 9 (b).the amount of data affects the size of the index. (3) Query optimization strategy and result analysis on large scale dataset In order to test the influence of the number of keywords and the number of range query conditions on query performance, experiments were carried out on 100 million real dataset, comparison of ICF-HBase and HUAWEI Hindex two methods were tested on the HBase index method of standard HBase, Hindex and improved, the experimental results are shown in figure 10. The overall trend increases in the keyword and query conditions, the longer the query. By the comparison of Figure 10 (a) and 10 (b), it is found that the range query is longer than the keyword matching query, which is due to the large amount of computation in the range query, in addition to the equivalent calculation, there are more than, less than. Overall, ICF-HBase query time is shorter than 2-5 times in Hindex, and the increase of keywords and query conditions, the better the query performance of ICF-HBase. This is because the Hindex constructs a separate index table stored in each region, but in the query process for the index table and the original data are carried out by the scan operation, so its query efficiency. In this paper, the ICF-HBase method is proposed to set up the index in the way of cluster, and the original data is stored in the same table. The query data are equivalent to only one scan operation, so the query efficiency is very high. 6. CONCLUSIONS This paper aims to solve the problem of rowkey index query only supports HBase, through the establishment of a separate list of clusters in the original data table for storing index information, put forward a kind of isomorphism column two level index clustering method based on HBase (ICF-HBase). On the basis of this index, we extend the data import, data query and index optimization strategy. The ICF-HBase index optimization strategy established in this paper, support single equivalence, query the logical relationship between single columns, the logical relationship of equivalent query, and range query. In the performance test of data writing, data quantity effect and data query optimization, the HUAWEI Hindex and ICF-HBase methods were used to test the large scale data sets. The results show that the ICF-HBase index method proposed in this paper is more efficient than the HUAWEI Hindex in index construction and data query. With the increase in data size, the query efficiency is also improved. 168

12 Figure 10 (a).the number of keywords Figure 10 (a).the effect of the number of on the query time.query conditions on the query time. However, there are some disadvantages in the way of constructing index in this paper. When written in each data according to the index of the related columns in the data to construct index, so when the other line had the same column value index, the index will produce redundancy, increase the amount of calculation of the index generation, but also to a certain extent, a waste of storage space. In the future, the problem of index redundancy will be improved. ACKNOWLEDGMENTS This work is supported by Major logistics research projects (AS214R002). REFERENCES Ge W., Luo S.M., Zhou W.H., Zhao D., Tang Y., Zhou J., Qu W.W., Yuan C.F., Huang Y.H. (2016). HiBase: A Hierarchical Indexing Mechanism and System for Efficient HBase Query, Chinese Journal of Computers, 39(1), HBase (2016). Apache Software Foundation, Huawei (2016). Hindex-Secondary Index for HBase, Kavitha A., Rajkumar N., Victor S.P. (2014). An Integrated and Efficient Approach to Measure Semantic Similarity between Short Sentences and Paragraphs, Advances in Modelling and Analysis B, 57(2), Mansour E.A. (2014). A proposed Intelligent Decision Support System for Marketing Planning in Industrial Enterprises, Modelling, Measurement and Control D, 35(1), Ngdata (2016). Lily hbase-indexer, Janarthanan P., Rajkumar N., Padmanaban G., Yamini S. (2014). Performance Analysis on Graph Based Information Retrieval Approaches, Advances in Modelling and Analysis D, 19(1), Taylor R.C. (2010). An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics, BMC Bioinformatics, 11 Suppl 12(12), Doi: / S12-S1. Vathi T.V., Raju G.S.N. (2015). Pattern Synthesis using Modified Differential Evolution Algorithm, Measurement and Control A, 88(1), Wang T.C., Hu X.X., Zhong S.S., ZhangY.J. (2016). Research on Knowledge Base System Based On UML and JqueryEasyUI, Reviewof ComputerEngineering Studies, 3(2), Doi: /rces Wei R.G., Zhen J.G., Bao L.L. (2015). Study on Mining Big Users Data in the Development of Hubei Auto- Parts Enterprise, Mathematical Modelling of Engineering Problems, 2(4), 1-6. Doi: /mmep Zhang C., Chen X.Y., Shi Z.L., Ge B. (2016). Algorithms for Spatio-temporal Queries in HBase, Journal of Chinese Computer Systems, 37(11), Zhang N., Zheng G., Chen H., Chen J. (2014). HBaseSpatial: A Scalable Spatial Data Storage Based on HBase, IEEE, International Conference on Trust, Security and Privacy in Computing and Communications, Doi: /TrustCom Zhang Y., Ma Y.Z., Meng X.F. (2012). Efficient Processing of Spatial Keyword Queries on HBase, Journal of Chinese Computer Systems, 33(10), Zou Y., Liu J., Wang S., Zha L., Xu Z. (2010). CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries, Ifip International Conference on Network and Parallel Computing6289, Doi: / _22. George L. (2011). HBase: the definitive guide, Andre, 12(1),

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

Using space-filling curves for multidimensional

Using space-filling curves for multidimensional Using space-filling curves for multidimensional indexing Dr. Bisztray Dénes Senior Research Engineer 1 Nokia Solutions and Networks 2014 In medias res Performance problems with RDBMS Switch to NoSQL store

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

The Design and Optimization of Database

The Design and Optimization of Database Journal of Physics: Conference Series PAPER OPEN ACCESS The Design and Optimization of Database To cite this article: Guo Feng 2018 J. Phys.: Conf. Ser. 1087 032006 View the article online for updates

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Research on Design and Application of Computer Database Quality Evaluation Model

Research on Design and Application of Computer Database Quality Evaluation Model Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Faster HBase queries. Introducing hindex Secondary indexes for HBase. ApacheCon North America Rajeshbabu Chintaguntla

Faster HBase queries. Introducing hindex Secondary indexes for HBase. ApacheCon North America Rajeshbabu Chintaguntla Security Level: Faster HBase queries Introducing hindex Secondary indexes for HBase ApacheCon North America 2014 www.huawei.com Rajeshbabu Chintaguntla rajeshbabu@apache.org HUAWEI TECHNOLOGIES CO., LTD.

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preliminary Research on Distributed Cluster Monitoring of G/S Model Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring

More information

HBase. Леонид Налчаджи

HBase. Леонид Налчаджи HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com HBase Overview Table layout Architecture Client API Key design 2 Overview 3 Overview NoSQL Column oriented Versioned 4 Overview All rows ordered by row

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents Send Orders for Reprints to reprints@benthamscience.ae 676 The Open Automation and Control Systems Journal, 2014, 6, 676-683 Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving

More information

Typical size of data you deal with on a daily basis

Typical size of data you deal with on a daily basis Typical size of data you deal with on a daily basis Processes More than 161 Petabytes of raw data a day https://aci.info/2014/07/12/the-dataexplosion-in-2014-minute-by-minuteinfographic/ On average, 1MB-2MB

More information

Scaling for Humongous amounts of data with MongoDB

Scaling for Humongous amounts of data with MongoDB Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com From here... http://bit.ly/ot71m4 ...to here... http://bit.ly/oxcsis

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014 NoSQL Databases Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 10, 2014 Amir H. Payberah (SICS) NoSQL Databases April 10, 2014 1 / 67 Database and Database Management System

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

Ghislain Fourny. Big Data 5. Column stores

Ghislain Fourny. Big Data 5. Column stores Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

CSE 544 Principles of Database Management Systems

CSE 544 Principles of Database Management Systems CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 5 - DBMS Architecture and Indexing 1 Announcements HW1 is due next Thursday How is it going? Projects: Proposals are due

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Design of Coal Mine Power Supply Monitoring System

Design of Coal Mine Power Supply Monitoring System 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 2016) Design of Coal Mine Power Supply Monitoring System Lei Shi 1, Guo Jin 2 and Jun Xu 3 1 2 Department of electronic

More information

BigTable: A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Allocation Methods Free-Space Management

More information

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch 619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

A MPI-based parallel pyramid building algorithm for large-scale RS image

A MPI-based parallel pyramid building algorithm for large-scale RS image A MPI-based parallel pyramid building algorithm for large-scale RS image Gaojin He, Wei Xiong, Luo Chen, Qiuyun Wu, Ning Jing College of Electronic and Engineering, National University of Defense Technology,

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management

More information

Research and Implementation of Server Load Balancing Strategy in Service System

Research and Implementation of Server Load Balancing Strategy in Service System Journal of Electronics and Information Science (2018) 3: 16-21 Clausius Scientific Press, Canada Research and Implementation of Server Load Balancing Strategy in Service System Yunpeng Zhang a, Liwei Liu

More information

Construction and Application of Cloud Data Center in University

Construction and Application of Cloud Data Center in University International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2014) Construction and Application of Cloud Data Center in University Hong Chai Institute of Railway Technology,

More information

Integration of information security and network data mining technology in the era of big data

Integration of information security and network data mining technology in the era of big data Acta Technica 62 No. 1A/2017, 157 166 c 2017 Institute of Thermomechanics CAS, v.v.i. Integration of information security and network data mining technology in the era of big data Lu Li 1 Abstract. The

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Database Design on Construction Project Cost System Nannan Zhang1,a, Wenfeng Song2,b

Database Design on Construction Project Cost System Nannan Zhang1,a, Wenfeng Song2,b 3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) Database Design on Construction Project Cost System Nannan Zhang1,a, Wenfeng Song2,b 1 School

More information

Dual-System Warm Standby of Remote Sensing Satellite Control System Technology

Dual-System Warm Standby of Remote Sensing Satellite Control System Technology 2016 3 rd International Conference on Materials Science and Mechanical Engineering (ICMSME 2016) ISBN: 978-1-60595-391-5 Dual-System Warm Standby of Remote Sensing Satellite Control System Technology Fei

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny. Big Data 5. Wide column stores Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services ADVANCED HBASE Architecture and Schema Design GeeCON, May 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE

A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE A STUDY ON THE TRANSLATION MECHANISM FROM RELATIONAL-BASED DATABASE TO COLUMN-BASED DATABASE Chin-Chao Huang, Wenching Liou National Chengchi University, Taiwan 99356015@nccu.edu.tw, w_liou@nccu.edu.tw

More information

NewSQL Databases MemSQL and VoltDB Experimental Evaluation

NewSQL Databases MemSQL and VoltDB Experimental Evaluation NewSQL Databases MemSQL and VoltDB Experimental Evaluation João Oliveira 1 and Jorge Bernardino 1,2 1 ISEC, Polytechnic of Coimbra, Rua Pedro Nunes, Coimbra, Portugal 2 CISUC Centre for Informatics and

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Chapter 12. File Management

Chapter 12. File Management Operating System Chapter 12. File Management Lynn Choi School of Electrical Engineering Files In most applications, files are key elements For most systems except some real-time systems, files are used

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

Shared-network scheme of SMV and GOOSE in smart substation

Shared-network scheme of SMV and GOOSE in smart substation J. Mod. Power Syst. Clean Energy (2014) 2(4):438 443 DOI 10.1007/s40565-014-0073-z Shared-network scheme of and in smart substation Wenlong WANG, Minghui LIU (&), Xicai ZHAO, Gui YANG Abstract The network

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Column Stores and HBase. Rui LIU, Maksim Hrytsenia Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

A Security Audit Module for HBase

A Security Audit Module for HBase 2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b 2nd Workshop on Advanced Research and Technology in Industry Applications (WARTIA 2016) The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

More information

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG * 2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5

More information

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

More information

W b b 2.0. = = Data Ex E pl p o l s o io i n

W b b 2.0. = = Data Ex E pl p o l s o io i n Hypertable Doug Judd Zvents, Inc. Background Web 2.0 = Data Explosion Web 2.0 Mt. Web 2.0 Traditional Tools Don t Scale Well Designed for a single machine Typical scaling solutions ad-hoc manual/static

More information

A Data Classification Algorithm of Internet of Things Based on Neural Network

A Data Classification Algorithm of Internet of Things Based on Neural Network A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To

More information

Tacked Link List - An Improved Linked List for Advance Resource Reservation

Tacked Link List - An Improved Linked List for Advance Resource Reservation Tacked Link List - An Improved Linked List for Advance Resource Reservation Li-Bing Wu, Jing Fan, Lei Nie, Bing-Yi Liu To cite this version: Li-Bing Wu, Jing Fan, Lei Nie, Bing-Yi Liu. Tacked Link List

More information

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars Practical MySQL Performance Optimization Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars In This Presentation We ll Look at how to approach Performance Optimization Discuss Practical

More information

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm Qingting Zhu 1*, Haifeng Lu 2 and Xinliang Xu 3 1 School of Computer Science and Software Engineering,

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Personalized Search for TV Programs Based on Software Man

Personalized Search for TV Programs Based on Software Man Personalized Search for TV Programs Based on Software Man 12 Department of Computer Science, Zhengzhou College of Science &Technology Zhengzhou, China 450064 E-mail: 492590002@qq.com Bao-long Zhang 3 Department

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

BigDataBench: a Big Data Benchmark Suite from Web Search Engines

BigDataBench: a Big Data Benchmark Suite from Web Search Engines BigDataBench: a Big Data Benchmark Suite from Web Search Engines Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Jianfeng Zhan, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, and Bizhu

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon HBase vs Neo4j Technical overview Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon 12th October 2017 1 Contents 1 Introduction 3 2 Overview of HBase and Neo4j

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Research on Load Balancing and Database Replication based on Linux

Research on Load Balancing and Database Replication based on Linux Joint International Information Technology, Mechanical and Electronic Engineering Conference (JIMEC 2016) Research on Load Balancing and Database Replication based on Linux Ou Li*, Yan Chen, Taoying Li

More information

Application of Redundant Backup Technology in Network Security

Application of Redundant Backup Technology in Network Security 2018 2nd International Conference on Systems, Computing, and Applications (SYSTCA 2018) Application of Redundant Backup Technology in Network Security Shuwen Deng1, Siping Hu*, 1, Dianhua Wang1, Limin

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information