S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data

Size: px

Start display at page:

Download "S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data"

Rosalind Boone
5 years ago
Views:

1 S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data Tran Vu Pham, Duc Hai Nguyen and Khue Doan Faculty of Computer Science and Engineering Ho Chi Minh City University of Technology Ho Chi Minh City, Vietnam Contact ptvu@hcmut.edu.vn Abstract The popularity of applications that use spatiotemporal data in real-time has brought about the need for efficient storage systems. There exist different systems for storing data such as the traditional relational database management systems, NoSQL databases, RAM-based, and Hadoop/MapReduce based systems. However, due to the special characteristics of spatio-temporal data used in real-time applications, these available systems do not well match their performance requirements. This paper introduces a distributed RAM-based storage system that works in combination with an NoSQL database to provide better performance for real-time applications that uses huge volume of spatio-temporal data. Experiment results show that the proposed system gives better performance than disk-based NoSQL databases, and scales well when the volume of data is increased. Keywords spatio-temporal; real-time; big data; RAM storage I. INTRODUCTION In recent years, spatio-temporal data is becoming popular and useful. In wireless sensor network applications, such as tracking moving objects or monitoring the situation of constructions, temporal and spatial information are frequently embedded into data packet sent from sensor nodes to data center [1] [2]. In transportation, GPS signal is used for many applications in the field of Intelligent Transportation Systems (ITS) such as traffic flow prediction [3], real-time vehicle routing [4], and railway mapping [5]. Apparently, there is a need for these applications to employ real-time information to improve user s experience as well as detect troubles in time. However, spatio-temporal dataset is huge and has a high rate of input coming from various sources in real-time. Hence, providing good storage mechanism for spatio-temporal data to support such kinds of application is a challenging issue. A storage system for real-time data coming at a very fast rate usually needs to deal with two issues: (i) storing huge volume of data efficiently in real-time, and (ii) getting (querying) the data out of the storage with very low latency. Traditional Relational databases (RDBMS), such as Oracle and MySQL, are currently very popular. However, due to its complex mechanisms for storing, indexing, and querying data, the performance of relational model is not perfectly suitable for very large spatio-temporal and real-time datasets. Emerging NoSQL databases, such as [6], RAMCloud [7], and HBase [8], allow multiple data formats by using simple data models such as key-value or document. They also have advantages in performance and scalability by sacrificing several consistency constraints or applying redundancy. With respect to storing and processing spatiotemporal data in real-time, NoSQL databases are more applicable than the traditional RDBMS. However, format-free data models supported by NoSQL databases commonly have only one primary index. Therefore, in case search queries involving multiple attributes, which is common with spatiotemporal datasets where time and geographical attributes such as latitude and longitude are always part of a query, simple key-value looking up might not be efficient. Composite key may be used to improve performance of NoSQL database [9]. Hadoop/MapReduce has emerged as a new solution for massive processing on large dataset. Many MapReduce based systems, such as those described in [10] [11], have been developed for mining large spatio-temporal datasets. However, Hadoop framework is not appropriate for real-time applications as its primary objective is throughput, not latency. In recent years, RAM has been widely used to keep longterm data in data intensive applications due to its superior performance over disks or flashes [7]. As a result, many inmemory stores have been developed, e.g. RAMCloud [7], H- Store [12], memcached [13]. The emergence of these storage systems brings about a new way of storing and processing large datasets for real-time applications. Motivated by these researches, in this article, we introduce a new scalable in memory storage system for spatio-temporal and real-time data. The new storage system includes an indexing mechanism which is optimized for storing data in memory to offer very high write throughput and low read latency. In the rest of this paper, we illustrate some key characteristics of spatio-temporal dataset (Section II). We then describe the basic idea and our implementation (Section III and IV). Some experiment results are reported and analyzed to evaluate the system performance (Section V). Finally, we conclude and discuss some other studies related to our work (Section VI and VII).

II. MOTIVATION APPLICATION We are currently running a Traffic Information System that collects GPS signals generated from GPS embedded devices on vehicles in Ho Chi Minh City to produce the real-time

2 II. MOTIVATION APPLICATION We are currently running a Traffic Information System that collects GPS signals generated from GPS embedded devices on vehicles in Ho Chi Minh City to produce the real-time traffic conditions of the city. As GPS signal is a typical type of spatiotemporal data, a storage system architecture that is applicable to GPS data can also be generalized for other types of spatiotemporal data. Fig. 1. Spatial distribution of GPS data in Ho Chi Minh City. GPS data is often collected from different types of devices such as smartphones, embedded GPS devices on buses, taxis, etc. Signals generated from different types of devices have different data formats, data fields and sizes. Size of the largest record is about 100 bytes. Every day, the system receives approximately 10 million GPS records from about 4200 buses in the city. Fig. 1 plots the distributions of GPS signals that the system received in four different days in March 2015 over the city area. The distribution is imbalanced. Data density at city center is extremely high while that at suburban area is very low because the majority of vehicles are in urban areas. However, spatial distribution looks almost the same every day. The reason is that transportation infrastructure and user behaviors are hardly changed. Therefore, we can safely assume that spatial distribution of spatio-temporal data is stable. In summary, the GPS signal that the Traffic Information System currently receives and processes in real-time has the following characteristics: (i) heterogeneous in format and size, (ii) huge in volume, and (iii) stable in space and time distribution. III. DESIGN PRINCIPLES A. Data organization In order to store huge volume of incoming data as well as to allow applications to process the data in real-time (with strictly limited latency), main memory (commonly known as RAM) is used as temporary storage for most recently data. The advantage of this approach is that data placed on RAM can be accessed much faster than those placed on secondary devices. In addition, RAM allows data to be accessed randomly without degrading performance, so we can freely write data without worrying about locality issue. However, as RAM is often limited in size and sensitive to power failure and crashing, the secondary storage (commonly hard disks) is also used for persistently storing the data. For each incoming record, two copies of the record are generated. One copy is stored in RAM and another is stored in the secondary storages. The content of each copy is different from each other, however. The hard disks store the original version that contains all information of incoming signal. Meanwhile, only several chosen data fields are stored RAM. The reason for this strategy is that several data fields of spatiotemporal data are accessed much more often than others. Thus, keeping data fields which are rarely accessed in RAM is not a proper decision as the space in RAM is more expensive. The dataset stored in hard disks is designed for backup purpose only. As most recent data is stored in RAM, every data manipulation operations are executed on dataset stored in RAM. This method helps to achieve the desired performance. Old data that is not frequently accessed is removed from RAM to save space for incoming records. A NoSQL database, i.e.,, is used for storing data on secondary storage. Since on-disk dataset is managed by an independent database, we shall explain our solution for in-memory storage system only. B. Data partitioning A set of distributed servers are used to handle the huge volume of data. Due to the stable distribution of data over space, as discussed in Section II, geographical coordinates are used as criteria for partitioning the data among servers. We divide the space into a grid of n n uniform square cells. Each cell has its own coordinates (x, y) where y is the number of cells in the same row on its left side and x is the number of cells in the same column below it. Hence, we can easily determine which cell a record falls in given its spatial coordinates (latitude, longitude) pair, the length of cell s side l and position of the origin (x 0, y 0 ) using following formulas: x = latitude x 0 (1) l y = longitude x 0 (2) l We use the density, i.e., the average number of points falling into a cell per day to represent the contribution of a cell to the overall load at the server to which it is assigned. Because the distribution of data over space is not uniform, equally distributing cells to RAM servers may not guarantee load balancing. In addition, locality is an important factor that must be taken into account. Thus, we decide to merge adjacent, low density cells together to form a new unit, called zone. A zone

3 can be a single cell or a group of adjacent cells that satisfies 2 conditions. First, the summary of density of every cell inside this zone must be lower than a threshold B. Second, if there is another zone located next to this zone and the summary of their density is lower than B, they must be merged together. The former condition ensures that every cell cannot be joined to form only one zone. The latter helps zone s density approach threshold B. Fig. 2 reveals an example of merging process a. b. Fig. 2. Merging separate cells in (a) into zones in (b) with B = 10. The gap between cells density is 5 but that of zones is only 2. By combining low density cells together, the density gap between center zones and those lies on the side in space is greatly reduced. We distribute zones to server instead of cells. However, cells is still useful in high speed data lookup. We construct 2 additional tables. One is used for cell-to-zone translation and the other for zone-to-server translation. Thus, for any point or rectangular defined by a pair of spatial coordinates, we can quickly determine corresponding cells by applying (1) and (2). Those cells are then looked up in cell-tozone table to find to which zones they belong. Those zones are then used for getting corresponding server by looking up in zone-to-server table. Because only table lookups and simple computation are performed, searching process complexity is O(1) which is extremely fast. Spatial distribution of dataset strictly depends on transportation infrastructure which infrequently changes, so we compute cell-to-zone translation table in advance and update it weekly or monthly. IV. IMPLEMENTATION S4STRD consists of two parts: a RAM based cluster and a cluster. In practical implementation, a server may be part of both RAM cluster and cluster. For cluster, we reuse the current implementation. In this section, we focus only on the implementation of our proposed RAM based storage cluster. A. Data standardization As data comes to the system in various formats and sizes, before being written to memory, the core component of signal must be transformed to another format which is easier to manage. We call this process standardization. During standardization, we convert the original spatio-temporal core, which consists of 4 primary attributes: device ID, generated time, written time, spatial coordinates, whose size is various to fixed-size one. This transformation brings about many benefits. Firstly, free space from deleted records can be reused without worrying about defragmentation. Secondly, maximum capacity and available space can be calculated exactly so that cleanup process can operate properly. Thirdly, if data is put into blocks or arrays, we can easily find its location given its index without using other support structures such as Hash tables or search trees. B. Memory allocation Readable blocks Filled blocks Filled blocks Buffer Buffer Zone Zone Fig. 3. Memory organization inside RAM server Block pool For our applications, update and delete operations are not applicable to raw GPS data. Therefore, each record is written once and read many times. With this type of dataset, sequential write and random read should be used to enhance performance, especially when there are many requests coming at the same time. We realize this idea by dividing memory space into fixed size blocks. Because the size of record is fixed, block size should be divisible to record s size. Each RAM server is responsible for storing points coming from several zones, hence blocks will be assigned to these zones to keep their data according to memory allocation mechanism described in Fig. 3. At the beginning, a set of empty blocks is allocated and placed at block pool. In each zone, only one block on which data can be written is called active block. Active block is filled with new data sequentially. If a record is written to active block, other write requests on the same block are locked until this operation finishes. Read operation, however, is not blocked in this situation. Other blocks in zone is filled up with data. These blocks are only available for reading. Each zone has its own buffer which is used to store the original version of signal of records written in active block. When active block is filled up, buffer is cleaned to reserve space for new records by pushing all remain data to. After that, a new empty block from block pools is allocated to this zone and becomes new active block. C. Indexing Spatial coordinates, time and device ID usually appear in query condition. As spatial coordinates are used for data partitioning and written time is sorted inside data blocks, we need to create indexes on generated time and device ID to yield expected performance during executing queries related to these fields. Since we have converted device ID to device code which is a numeric data field so B-Tree or its variants can be applied on this field. In case of generated time, it is not independent of written time like device ID. Therefore we can exploit this feature to avoid constructing another index structure. Theoretically, it takes the signal an amount of time to be transferred from

4 source device to data center, so written time wt should be equal to generated time gt plus propagation delay t: wt = gt + t (3) Equation (3) shows that if dataset is sorted by generated time, it should hold the same order as by written time. Practically, generated time can be later or earlier than written time. However, the difference is small and acceptable. Assume that application wants to get records whose generated time gt satisfies the condition s gt e where s and e are arbitrary timestamps and s e (s = e when application wants to get record generated at a particular moment). Instead of searching over generated time, we can perform lookup operation over written time after shifting e to the right by the amount of T. Our search condition is rewritten to s wt e + T. The search range is broaden to the right due to the presence of propagation time. We ignore wt < gt case and consider it as invalid record. The accuracy of result after applying this transformation depends on the value of T. In our case, T is estimated from the average of propagation time of valid records. An optional step after searching over written time using modified condition is validation. Because scanning over the list of result records is time consuming, we only validate a few first and last records where the probability of occurring fault records is high. D. Data manipulation operations With PUT operations, signals created form various devices are gathered in several gateways of the data center before being sent to storage system. Those machines act as data clients. Filtering and validation are performed at client. After determining the server that is responsible for writing new record, client sends it directly to RAM server. At RAM server, the core component of the signal is extracted, standardized and written to RAM store. Simultaneously, the original version of the signal is put to server s buffer. After both tasks have been finished, server will immediately send a response to client to inform that the request is performed successfully. Different from PUT, GET is a complex process as it involves in both RAM servers and. As described in Fig. 4, with each GET request received from application, API library module at client determines whether RAM servers or is suitable for processing the request according to the written time constraint. If search range falls inside RAM server s scope, it is better to send request to RAM servers, otherwise request must be forwarded to. We shall discuss only about the former case while the latter is all about implementation. GET request must be performed inside a spatial area specified by a rectangular constructed from a pair of spatial coordinates. Since client has already stored copies of translation tables, it can determine RAM servers to which it has to send request by itself. If the information in translation tables is out-of-date, client must download new version from RAM router before looking up for RAM servers. After corresponding servers are found, the request is sent to them simultaneously. At server, translation tables are used again to determine related zones. Only zones that overlapped with the rectangular specified by the request are considered. The servers process the request and respond to the client Client Application API Library 4 RAM router RAM Data server 2 RAM Data server Fig. 4.. GET process. Arrow (1) presents the request sent from application to API library module. (2) indicates request forwarded from API library to storage system. (3) indicates response and (4) indicates the final result. At client, after receiving response from all related RAM servers, the data inside these response is extracted and combined together to form the final result. Since RAM router almost does not participate in GET process, the potential for occurring bottle neck at router is virtually impossible. V. PERFORMANCE EVALUATION GPS dataset collected in Ho Chi Minh City in March, 2015 was used in experiments to evaluate system s performance. was chosen as a counter example for two reasons: (i) was a typical NoSQL database and used in our old system, and (ii) was implemented with an in memory cache, which was somehow similar to our in memory storage. Firstly, to evaluate the PUT and GET performance, we deployed a standalone RAM server and had it worked under continuous PUT and GET request stream. After running the experiment on a single machine equipped with Intel Xeon E and 32GB of RAM with 10 million records from real dataset, the following result was obtained: put performance was approximately 27 thousand records per second and read latency was about 22 microseconds. The result was better than that of. Particularly, with the same hardware configuration, s throughput was only 13 thousand records per second and its latency was 27 microsecond. Secondly, to evaluate the scalability of the system, we ran experiments on a small cluster consisting of 8 nodes. Each node was equipped with one Intel Xeon E3-12xx, 8GB of RAM. All machines were running CentOS 7 with Linux 3.10 kernel connected by one GB Ethernet switch. For all tests, the cluster was deployed over 8 nodes, one node was configured as router, and 3 other nodes were used as configuration servers. We deployed version and kept its default settings. We set up the experiment to measure our system put performance as following: initially, the storage was left empty. We added new servers to the cluster one by one. Each time the number of servers changed, we removed old records and had 4 clients continuously put 4 million new GPS records. Because

5 Latency (microsecond) Throughput (thousand records/s) data was synchronized from RAM to disks to guarantee durability, therefore under very heavy write workloads, server s buffer was easily overflowed causing performance degradation. To examine the impact of this phenomenon, we measured the performance in 2 cases: put with and without synchronization. Fig. 5 shows how PUT performance would change if the system increased in size according to this experiment. Apparently, the scalability of the system was relatively good in both cases. However, due to poor write performance of, which is around records per second, synchronization degraded performance by 2 to 10 thousand records per second Number of nodes Fig. 5. PUT throughput as the number of nodes changes RAM storage Put with synchronization Put without synchronization Number of clients Fig. 6. Latency of RAM storage and under different write workloads. Each client generated a continuous stream of PUT request. We added another client to generate GET request. The average latency of 1 record observed by this client is plotted on the graph. Fig. 6 graphs the read latency of our system and as the number of client increased. In this experiment, the storage system was preloaded with 2 million records. Then several clients sent PUT requests to create write workload. At the same time, we started another client and had it sent GET requests. A GET request was a range-query over a rectangular area generated randomly using uniform distribution. The latency observed by this client showed that our system worked relatively well under heavy load. The latency of our RAM storage was always smaller than 50 microseconds and increased slowly as the number of client increases. In contrast, s read latency was quite high, approximately 6-7 times higher than that of our system. Since every request was sent to router before being redirected to data nodes, the router could easily be a bottleneck as the load rises. It is proven in Fig. 8 as the line of grows significantly as the number of clients increases. VI. RELATED WORK Memcached [13] and MICA [14] are among the very first implementations using RAM to improve performance by keeping frequently used items on memory. Those stores are mainly used as a cache for other applications. A key factor that distinguishes our in memory storage system from cache is that in mentioned caches, lifetime of data depends on how it is accessed so applications whose behavior often changes will not gain so much benefit from them. It is not a problem in case of S4STRD since it tries to keep every incoming record in RAM. It only removes old records (in terms of written time) when memory is exhausted. RAMCloud is a RAM-based key-value store which divides memory into segments to maximize memory utilization [14]. The objective of RAMCloud is to reduce both read and write latency while our system focuses on write throughput and read latency. Similarly, H-Store [12] is also a distributed RAMbased database. Unlike our system, H-Store provides relational semantics to manage data. This data model does not allow H- Store to store wide-format dataset. Although there are some studies mentioning about memory utilization, for instance, PLACE [15] and Park et al. [16], most of researches in storing spatio-temporal data focus on diskbased databases. Recent studies primarily focus on optimizing index structures for existing data models. Fox et al. [9] proposed a method to encode temporal and spatial information into record s key of column-family oriented stores but they did not take device ID into account. Akulakrishna et al. [17] introduced NineCellGrid method to improve performance of queries related to routing. Its primary idea is duplicating data in one area then distribute them to neighbor areas to achieve locality. However, this strategy requires too much extra space to be implemented in memory. Hadoop-GIS [10], CloST [11] and RTIC-C [18] introduced a new approach using Hadoop/MapReduce framework. These approaches build new engines on the top of Hadoop to analyze and broken requests from applications into small MapReduce jobs and then schedule their execution. These solutions work well with applications that want to process a huge amount of data but are inappropriate for real-time ones. Hyperdex [19] proposes hyperspace hashing method to isolate data into groups of many subspaces to enable searching through different data dimensions. However, this solution requires a lot of extra memory space when the number of subspace increases. In case of spatial space, performance overhead still exists inside current index structures since they do not exploit stable spatial distribution of spatio-temporal data. R-Tree [20] and its variants, are the most popular ones but the cost of rebalancing of R-Tree after insertion is high and possibly accumulated. k-d Tree [21], Quad-tree [22] are imbalanced trees. Grid file [23] pays extra cost when overflow occurs.

6 Similarly, TrajStore [24] and PIST [25] also partition data into cells according to spatial coordinates and build temporal index inside each cell to speedup search process. However, they divide the space recursively which requires a tree structure to manage the partition instead of using table. VII. CONCLUSION In this article, we have provided an in-memory spatiotemporal storage system for real-time applications. Our system has exploited unique characteristics of spatio-temporal dataset as well as taken advantages of RAM technology to achieve very high write throughput and low read latency. Our future work focuses on solving the following problems. Synchronization between memory and secondary storage maintains durability but it slows down write process especially when system is under heavy load. One potential solution is to parallelize this process by spreading data to multiple nodes to improve overall throughput. Another issue is that updating spatial distribution periodically may allow effects of distribution change on performance to last for a long time before being recognized. Thus, an efficient mechanism to monitor and detect these changes is necessary. Besides, disk swapping should be also taken care of. Although we developed this system for spatio-temporal related applications only, we believe that the idea we presented in this paper is generally applicable for other dedicated storage systems which is designed for a particular dataset and/or a specific class of applications. ACKNOWLEDGEMENT This research is funded by Vietnam National University Ho Chi Minh City under grant number B REFERENCES [1] A. Cerpa, J. Elson, M. Hamilton and J. Zhao, "Habitat Monitoring: Application Driver for wireless Communications Technology," in Proceedings of the First ACM SIGCOMM Workshop on Data Communications in Latin America and the Caribbean, San Jose, Costa Rica, [2] H. T. Kung and D. Vlah, "Efficient Location Tracking Using Sensor Networks," in Proceedings of 2003 IEEE Wireless Communications and Networking Conference (WCNC), New Orleans, [3] E. Necula, "Dynamic traffic flow prediction based on GPS Data," in IEEE 26th International Conference on Tools with Artificial Intelligence, Limassol, [4] G. Ghiani, F. Guerriero, G. Laporte and R. Musmanno, "Real-time vehicle routing: Solution concepts, algorithms and parallel computing strategies," European Journal of Operational Research, vol. 151, no. 1, pp. 1-11, [5] G. Mintsis, S. Basbas, P. Papaioannou, C. Taxiltaris and I. N. Tziavos, "Applications of GPS technology in the land transportation system," European Journal od Operational Research, vol. 152, no. 2, pp , [6] "," [Online]. Available: [Accessed ]. [7] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann and R. Stutsman, "The case for RAMCloud," Communications of the ACM, vol. 54, no. 7, pp , July [8] "Hbase," [Online]. Available: [Accessed ]. [9] A. Fox, C. Eichelberger, J. Hughes and S. Lyon, "Spatio-temporal Indexing in Non-relational Distributed Databases," in IEEE International Conference on Big Data, Santa Clara, [10] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang and J. Saltz, "Hadoop- GIS: A High Performance Spatial Data Warehousing System over MapReduce," Proceedings of the VLDB Endowment, vol. 6, no. 11, pp , [11] H. Tan, W. Lou and L. M. Ni, "CloST: A Hadoop-based Storage System for Big Spatio-Temporal Data Analytics," in the 21st ACM International Conference on Information and Knowledge Management, New York, [12] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg and D. J. Abadi, "H-Store: A High-performance, distributed main memory transaction processing system," in Proceedings of the VLDB Endowment, [13] "memcached - a distributed memory object caching system," [Online]. Available: memcached.org. [Accessed June 2015]. [14] H. Lim, D. Han, D. G. Andersen and M. Kaminsky, "MICA: A Holistic Approach to Fast In-Memory Key-Value Storage," in 11th USENIX Symposium on Networked Systems Design and Implementation, Seattle, WA, USA, [15] S. M. Rumble, A. Kejriwal and J. Ousterhout, "Log-structured Memory for DRAM-based Storage," in The 12th USENIX Conference on File and Storage Technologies (FAST '14), Santa Clara, CA, USA, February, [16] M. F. Mokbel and W. G. Aref, "PLACE: A Scalable Location-aware Database Server for Spatio-temporal Data Streams," IEEE Data Engineering Bulletin, vol. 28(3), pp. 3-10, [17] J. Park, B. Hong, K. An and J. Jung, "A Unified Index for Moving- Objects Databases," in Computational Science and Its Applications - ICCSA 2006, Glasgow, [18] P. K. Akulakrishna, J. Lakshmi and S. K. Nandy, "Efficient Storage of Big-Data for Real-Time GPS Applications," in 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, Sydney, NSW, [19] J. Yu, F. Jiang and T. Zhu, "RTIC-C: A Big Data System for Massive Traffic Information Mining," in 2013 International Conference on Cloud Computing and Big Data, Fuzhou, [20] R. Escriva, B. Wong and E. G. Sirer, "HyperDex: a distributed, searchable key-value store," in ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, New York, NY, USA, [21] A. Guttman, "R-trees: a dynamic index structure for spatial searching," in SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data, New York, NY, USA, [22] J. L. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp , Sept [23] H. Samet, "The Quadtree and Related Hierarchical Data Structures," ACM Computing Surveys (CSUR), vol. 16, no. 2, pp , June, [24] J. Nievergelt, H. Hinterberger and K. C. Sevcik, "The Grid File: An Adaptive, Symmetric Multikey File Structure," ACM Transactions on Database Systems (TODS), vol. 9, no. 1, pp , [25] P. Cudre-Mauroux, E. Wu and S. Madden, "TrajStore: An adaptive Storate System for Very Large Trajectory Data Sets," in 26th International Conference on Data Engineering (ICDE 2010), California, [26] V. Botea, D. Mallett, M. A. Nascimento and J. Sander, "PIST: An Efficient and Practical Indexing Technique for Historical Spatio- Temporal Point Data," Geoinformatica, vol. 12, no. 2, pp , 2008.

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,