S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data

Size: px
Start display at page:

Download "S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data"

Transcription

1 S4STRD: A Scalable in Memory Storage System for Spatio-Temporal Real-time Data Tran Vu Pham, Duc Hai Nguyen and Khue Doan Faculty of Computer Science and Engineering Ho Chi Minh City University of Technology Ho Chi Minh City, Vietnam Contact ptvu@hcmut.edu.vn Abstract The popularity of applications that use spatiotemporal data in real-time has brought about the need for efficient storage systems. There exist different systems for storing data such as the traditional relational database management systems, NoSQL databases, RAM-based, and Hadoop/MapReduce based systems. However, due to the special characteristics of spatio-temporal data used in real-time applications, these available systems do not well match their performance requirements. This paper introduces a distributed RAM-based storage system that works in combination with an NoSQL database to provide better performance for real-time applications that uses huge volume of spatio-temporal data. Experiment results show that the proposed system gives better performance than disk-based NoSQL databases, and scales well when the volume of data is increased. Keywords spatio-temporal; real-time; big data; RAM storage I. INTRODUCTION In recent years, spatio-temporal data is becoming popular and useful. In wireless sensor network applications, such as tracking moving objects or monitoring the situation of constructions, temporal and spatial information are frequently embedded into data packet sent from sensor nodes to data center [1] [2]. In transportation, GPS signal is used for many applications in the field of Intelligent Transportation Systems (ITS) such as traffic flow prediction [3], real-time vehicle routing [4], and railway mapping [5]. Apparently, there is a need for these applications to employ real-time information to improve user s experience as well as detect troubles in time. However, spatio-temporal dataset is huge and has a high rate of input coming from various sources in real-time. Hence, providing good storage mechanism for spatio-temporal data to support such kinds of application is a challenging issue. A storage system for real-time data coming at a very fast rate usually needs to deal with two issues: (i) storing huge volume of data efficiently in real-time, and (ii) getting (querying) the data out of the storage with very low latency. Traditional Relational databases (RDBMS), such as Oracle and MySQL, are currently very popular. However, due to its complex mechanisms for storing, indexing, and querying data, the performance of relational model is not perfectly suitable for very large spatio-temporal and real-time datasets. Emerging NoSQL databases, such as [6], RAMCloud [7], and HBase [8], allow multiple data formats by using simple data models such as key-value or document. They also have advantages in performance and scalability by sacrificing several consistency constraints or applying redundancy. With respect to storing and processing spatiotemporal data in real-time, NoSQL databases are more applicable than the traditional RDBMS. However, format-free data models supported by NoSQL databases commonly have only one primary index. Therefore, in case search queries involving multiple attributes, which is common with spatiotemporal datasets where time and geographical attributes such as latitude and longitude are always part of a query, simple key-value looking up might not be efficient. Composite key may be used to improve performance of NoSQL database [9]. Hadoop/MapReduce has emerged as a new solution for massive processing on large dataset. Many MapReduce based systems, such as those described in [10] [11], have been developed for mining large spatio-temporal datasets. However, Hadoop framework is not appropriate for real-time applications as its primary objective is throughput, not latency. In recent years, RAM has been widely used to keep longterm data in data intensive applications due to its superior performance over disks or flashes [7]. As a result, many inmemory stores have been developed, e.g. RAMCloud [7], H- Store [12], memcached [13]. The emergence of these storage systems brings about a new way of storing and processing large datasets for real-time applications. Motivated by these researches, in this article, we introduce a new scalable in memory storage system for spatio-temporal and real-time data. The new storage system includes an indexing mechanism which is optimized for storing data in memory to offer very high write throughput and low read latency. In the rest of this paper, we illustrate some key characteristics of spatio-temporal dataset (Section II). We then describe the basic idea and our implementation (Section III and IV). Some experiment results are reported and analyzed to evaluate the system performance (Section V). Finally, we conclude and discuss some other studies related to our work (Section VI and VII).

2 II. MOTIVATION APPLICATION We are currently running a Traffic Information System that collects GPS signals generated from GPS embedded devices on vehicles in Ho Chi Minh City to produce the real-time traffic conditions of the city. As GPS signal is a typical type of spatiotemporal data, a storage system architecture that is applicable to GPS data can also be generalized for other types of spatiotemporal data. Fig. 1. Spatial distribution of GPS data in Ho Chi Minh City. GPS data is often collected from different types of devices such as smartphones, embedded GPS devices on buses, taxis, etc. Signals generated from different types of devices have different data formats, data fields and sizes. Size of the largest record is about 100 bytes. Every day, the system receives approximately 10 million GPS records from about 4200 buses in the city. Fig. 1 plots the distributions of GPS signals that the system received in four different days in March 2015 over the city area. The distribution is imbalanced. Data density at city center is extremely high while that at suburban area is very low because the majority of vehicles are in urban areas. However, spatial distribution looks almost the same every day. The reason is that transportation infrastructure and user behaviors are hardly changed. Therefore, we can safely assume that spatial distribution of spatio-temporal data is stable. In summary, the GPS signal that the Traffic Information System currently receives and processes in real-time has the following characteristics: (i) heterogeneous in format and size, (ii) huge in volume, and (iii) stable in space and time distribution. III. DESIGN PRINCIPLES A. Data organization In order to store huge volume of incoming data as well as to allow applications to process the data in real-time (with strictly limited latency), main memory (commonly known as RAM) is used as temporary storage for most recently data. The advantage of this approach is that data placed on RAM can be accessed much faster than those placed on secondary devices. In addition, RAM allows data to be accessed randomly without degrading performance, so we can freely write data without worrying about locality issue. However, as RAM is often limited in size and sensitive to power failure and crashing, the secondary storage (commonly hard disks) is also used for persistently storing the data. For each incoming record, two copies of the record are generated. One copy is stored in RAM and another is stored in the secondary storages. The content of each copy is different from each other, however. The hard disks store the original version that contains all information of incoming signal. Meanwhile, only several chosen data fields are stored RAM. The reason for this strategy is that several data fields of spatiotemporal data are accessed much more often than others. Thus, keeping data fields which are rarely accessed in RAM is not a proper decision as the space in RAM is more expensive. The dataset stored in hard disks is designed for backup purpose only. As most recent data is stored in RAM, every data manipulation operations are executed on dataset stored in RAM. This method helps to achieve the desired performance. Old data that is not frequently accessed is removed from RAM to save space for incoming records. A NoSQL database, i.e.,, is used for storing data on secondary storage. Since on-disk dataset is managed by an independent database, we shall explain our solution for in-memory storage system only. B. Data partitioning A set of distributed servers are used to handle the huge volume of data. Due to the stable distribution of data over space, as discussed in Section II, geographical coordinates are used as criteria for partitioning the data among servers. We divide the space into a grid of n n uniform square cells. Each cell has its own coordinates (x, y) where y is the number of cells in the same row on its left side and x is the number of cells in the same column below it. Hence, we can easily determine which cell a record falls in given its spatial coordinates (latitude, longitude) pair, the length of cell s side l and position of the origin (x 0, y 0 ) using following formulas: x = latitude x 0 (1) l y = longitude x 0 (2) l We use the density, i.e., the average number of points falling into a cell per day to represent the contribution of a cell to the overall load at the server to which it is assigned. Because the distribution of data over space is not uniform, equally distributing cells to RAM servers may not guarantee load balancing. In addition, locality is an important factor that must be taken into account. Thus, we decide to merge adjacent, low density cells together to form a new unit, called zone. A zone

3 can be a single cell or a group of adjacent cells that satisfies 2 conditions. First, the summary of density of every cell inside this zone must be lower than a threshold B. Second, if there is another zone located next to this zone and the summary of their density is lower than B, they must be merged together. The former condition ensures that every cell cannot be joined to form only one zone. The latter helps zone s density approach threshold B. Fig. 2 reveals an example of merging process a. b. Fig. 2. Merging separate cells in (a) into zones in (b) with B = 10. The gap between cells density is 5 but that of zones is only 2. By combining low density cells together, the density gap between center zones and those lies on the side in space is greatly reduced. We distribute zones to server instead of cells. However, cells is still useful in high speed data lookup. We construct 2 additional tables. One is used for cell-to-zone translation and the other for zone-to-server translation. Thus, for any point or rectangular defined by a pair of spatial coordinates, we can quickly determine corresponding cells by applying (1) and (2). Those cells are then looked up in cell-tozone table to find to which zones they belong. Those zones are then used for getting corresponding server by looking up in zone-to-server table. Because only table lookups and simple computation are performed, searching process complexity is O(1) which is extremely fast. Spatial distribution of dataset strictly depends on transportation infrastructure which infrequently changes, so we compute cell-to-zone translation table in advance and update it weekly or monthly. IV. IMPLEMENTATION S4STRD consists of two parts: a RAM based cluster and a cluster. In practical implementation, a server may be part of both RAM cluster and cluster. For cluster, we reuse the current implementation. In this section, we focus only on the implementation of our proposed RAM based storage cluster. A. Data standardization As data comes to the system in various formats and sizes, before being written to memory, the core component of signal must be transformed to another format which is easier to manage. We call this process standardization. During standardization, we convert the original spatio-temporal core, which consists of 4 primary attributes: device ID, generated time, written time, spatial coordinates, whose size is various to fixed-size one. This transformation brings about many benefits. Firstly, free space from deleted records can be reused without worrying about defragmentation. Secondly, maximum capacity and available space can be calculated exactly so that cleanup process can operate properly. Thirdly, if data is put into blocks or arrays, we can easily find its location given its index without using other support structures such as Hash tables or search trees. B. Memory allocation Readable blocks Filled blocks Filled blocks Buffer Buffer Zone Zone Fig. 3. Memory organization inside RAM server Block pool For our applications, update and delete operations are not applicable to raw GPS data. Therefore, each record is written once and read many times. With this type of dataset, sequential write and random read should be used to enhance performance, especially when there are many requests coming at the same time. We realize this idea by dividing memory space into fixed size blocks. Because the size of record is fixed, block size should be divisible to record s size. Each RAM server is responsible for storing points coming from several zones, hence blocks will be assigned to these zones to keep their data according to memory allocation mechanism described in Fig. 3. At the beginning, a set of empty blocks is allocated and placed at block pool. In each zone, only one block on which data can be written is called active block. Active block is filled with new data sequentially. If a record is written to active block, other write requests on the same block are locked until this operation finishes. Read operation, however, is not blocked in this situation. Other blocks in zone is filled up with data. These blocks are only available for reading. Each zone has its own buffer which is used to store the original version of signal of records written in active block. When active block is filled up, buffer is cleaned to reserve space for new records by pushing all remain data to. After that, a new empty block from block pools is allocated to this zone and becomes new active block. C. Indexing Spatial coordinates, time and device ID usually appear in query condition. As spatial coordinates are used for data partitioning and written time is sorted inside data blocks, we need to create indexes on generated time and device ID to yield expected performance during executing queries related to these fields. Since we have converted device ID to device code which is a numeric data field so B-Tree or its variants can be applied on this field. In case of generated time, it is not independent of written time like device ID. Therefore we can exploit this feature to avoid constructing another index structure. Theoretically, it takes the signal an amount of time to be transferred from

4 source device to data center, so written time wt should be equal to generated time gt plus propagation delay t: wt = gt + t (3) Equation (3) shows that if dataset is sorted by generated time, it should hold the same order as by written time. Practically, generated time can be later or earlier than written time. However, the difference is small and acceptable. Assume that application wants to get records whose generated time gt satisfies the condition s gt e where s and e are arbitrary timestamps and s e (s = e when application wants to get record generated at a particular moment). Instead of searching over generated time, we can perform lookup operation over written time after shifting e to the right by the amount of T. Our search condition is rewritten to s wt e + T. The search range is broaden to the right due to the presence of propagation time. We ignore wt < gt case and consider it as invalid record. The accuracy of result after applying this transformation depends on the value of T. In our case, T is estimated from the average of propagation time of valid records. An optional step after searching over written time using modified condition is validation. Because scanning over the list of result records is time consuming, we only validate a few first and last records where the probability of occurring fault records is high. D. Data manipulation operations With PUT operations, signals created form various devices are gathered in several gateways of the data center before being sent to storage system. Those machines act as data clients. Filtering and validation are performed at client. After determining the server that is responsible for writing new record, client sends it directly to RAM server. At RAM server, the core component of the signal is extracted, standardized and written to RAM store. Simultaneously, the original version of the signal is put to server s buffer. After both tasks have been finished, server will immediately send a response to client to inform that the request is performed successfully. Different from PUT, GET is a complex process as it involves in both RAM servers and. As described in Fig. 4, with each GET request received from application, API library module at client determines whether RAM servers or is suitable for processing the request according to the written time constraint. If search range falls inside RAM server s scope, it is better to send request to RAM servers, otherwise request must be forwarded to. We shall discuss only about the former case while the latter is all about implementation. GET request must be performed inside a spatial area specified by a rectangular constructed from a pair of spatial coordinates. Since client has already stored copies of translation tables, it can determine RAM servers to which it has to send request by itself. If the information in translation tables is out-of-date, client must download new version from RAM router before looking up for RAM servers. After corresponding servers are found, the request is sent to them simultaneously. At server, translation tables are used again to determine related zones. Only zones that overlapped with the rectangular specified by the request are considered. The servers process the request and respond to the client Client Application API Library 4 RAM router RAM Data server 2 RAM Data server Fig. 4.. GET process. Arrow (1) presents the request sent from application to API library module. (2) indicates request forwarded from API library to storage system. (3) indicates response and (4) indicates the final result. At client, after receiving response from all related RAM servers, the data inside these response is extracted and combined together to form the final result. Since RAM router almost does not participate in GET process, the potential for occurring bottle neck at router is virtually impossible. V. PERFORMANCE EVALUATION GPS dataset collected in Ho Chi Minh City in March, 2015 was used in experiments to evaluate system s performance. was chosen as a counter example for two reasons: (i) was a typical NoSQL database and used in our old system, and (ii) was implemented with an in memory cache, which was somehow similar to our in memory storage. Firstly, to evaluate the PUT and GET performance, we deployed a standalone RAM server and had it worked under continuous PUT and GET request stream. After running the experiment on a single machine equipped with Intel Xeon E and 32GB of RAM with 10 million records from real dataset, the following result was obtained: put performance was approximately 27 thousand records per second and read latency was about 22 microseconds. The result was better than that of. Particularly, with the same hardware configuration, s throughput was only 13 thousand records per second and its latency was 27 microsecond. Secondly, to evaluate the scalability of the system, we ran experiments on a small cluster consisting of 8 nodes. Each node was equipped with one Intel Xeon E3-12xx, 8GB of RAM. All machines were running CentOS 7 with Linux 3.10 kernel connected by one GB Ethernet switch. For all tests, the cluster was deployed over 8 nodes, one node was configured as router, and 3 other nodes were used as configuration servers. We deployed version and kept its default settings. We set up the experiment to measure our system put performance as following: initially, the storage was left empty. We added new servers to the cluster one by one. Each time the number of servers changed, we removed old records and had 4 clients continuously put 4 million new GPS records. Because

5 Latency (microsecond) Throughput (thousand records/s) data was synchronized from RAM to disks to guarantee durability, therefore under very heavy write workloads, server s buffer was easily overflowed causing performance degradation. To examine the impact of this phenomenon, we measured the performance in 2 cases: put with and without synchronization. Fig. 5 shows how PUT performance would change if the system increased in size according to this experiment. Apparently, the scalability of the system was relatively good in both cases. However, due to poor write performance of, which is around records per second, synchronization degraded performance by 2 to 10 thousand records per second Number of nodes Fig. 5. PUT throughput as the number of nodes changes RAM storage Put with synchronization Put without synchronization Number of clients Fig. 6. Latency of RAM storage and under different write workloads. Each client generated a continuous stream of PUT request. We added another client to generate GET request. The average latency of 1 record observed by this client is plotted on the graph. Fig. 6 graphs the read latency of our system and as the number of client increased. In this experiment, the storage system was preloaded with 2 million records. Then several clients sent PUT requests to create write workload. At the same time, we started another client and had it sent GET requests. A GET request was a range-query over a rectangular area generated randomly using uniform distribution. The latency observed by this client showed that our system worked relatively well under heavy load. The latency of our RAM storage was always smaller than 50 microseconds and increased slowly as the number of client increases. In contrast, s read latency was quite high, approximately 6-7 times higher than that of our system. Since every request was sent to router before being redirected to data nodes, the router could easily be a bottleneck as the load rises. It is proven in Fig. 8 as the line of grows significantly as the number of clients increases. VI. RELATED WORK Memcached [13] and MICA [14] are among the very first implementations using RAM to improve performance by keeping frequently used items on memory. Those stores are mainly used as a cache for other applications. A key factor that distinguishes our in memory storage system from cache is that in mentioned caches, lifetime of data depends on how it is accessed so applications whose behavior often changes will not gain so much benefit from them. It is not a problem in case of S4STRD since it tries to keep every incoming record in RAM. It only removes old records (in terms of written time) when memory is exhausted. RAMCloud is a RAM-based key-value store which divides memory into segments to maximize memory utilization [14]. The objective of RAMCloud is to reduce both read and write latency while our system focuses on write throughput and read latency. Similarly, H-Store [12] is also a distributed RAMbased database. Unlike our system, H-Store provides relational semantics to manage data. This data model does not allow H- Store to store wide-format dataset. Although there are some studies mentioning about memory utilization, for instance, PLACE [15] and Park et al. [16], most of researches in storing spatio-temporal data focus on diskbased databases. Recent studies primarily focus on optimizing index structures for existing data models. Fox et al. [9] proposed a method to encode temporal and spatial information into record s key of column-family oriented stores but they did not take device ID into account. Akulakrishna et al. [17] introduced NineCellGrid method to improve performance of queries related to routing. Its primary idea is duplicating data in one area then distribute them to neighbor areas to achieve locality. However, this strategy requires too much extra space to be implemented in memory. Hadoop-GIS [10], CloST [11] and RTIC-C [18] introduced a new approach using Hadoop/MapReduce framework. These approaches build new engines on the top of Hadoop to analyze and broken requests from applications into small MapReduce jobs and then schedule their execution. These solutions work well with applications that want to process a huge amount of data but are inappropriate for real-time ones. Hyperdex [19] proposes hyperspace hashing method to isolate data into groups of many subspaces to enable searching through different data dimensions. However, this solution requires a lot of extra memory space when the number of subspace increases. In case of spatial space, performance overhead still exists inside current index structures since they do not exploit stable spatial distribution of spatio-temporal data. R-Tree [20] and its variants, are the most popular ones but the cost of rebalancing of R-Tree after insertion is high and possibly accumulated. k-d Tree [21], Quad-tree [22] are imbalanced trees. Grid file [23] pays extra cost when overflow occurs.

6 Similarly, TrajStore [24] and PIST [25] also partition data into cells according to spatial coordinates and build temporal index inside each cell to speedup search process. However, they divide the space recursively which requires a tree structure to manage the partition instead of using table. VII. CONCLUSION In this article, we have provided an in-memory spatiotemporal storage system for real-time applications. Our system has exploited unique characteristics of spatio-temporal dataset as well as taken advantages of RAM technology to achieve very high write throughput and low read latency. Our future work focuses on solving the following problems. Synchronization between memory and secondary storage maintains durability but it slows down write process especially when system is under heavy load. One potential solution is to parallelize this process by spreading data to multiple nodes to improve overall throughput. Another issue is that updating spatial distribution periodically may allow effects of distribution change on performance to last for a long time before being recognized. Thus, an efficient mechanism to monitor and detect these changes is necessary. Besides, disk swapping should be also taken care of. Although we developed this system for spatio-temporal related applications only, we believe that the idea we presented in this paper is generally applicable for other dedicated storage systems which is designed for a particular dataset and/or a specific class of applications. ACKNOWLEDGEMENT This research is funded by Vietnam National University Ho Chi Minh City under grant number B REFERENCES [1] A. Cerpa, J. Elson, M. Hamilton and J. Zhao, "Habitat Monitoring: Application Driver for wireless Communications Technology," in Proceedings of the First ACM SIGCOMM Workshop on Data Communications in Latin America and the Caribbean, San Jose, Costa Rica, [2] H. T. Kung and D. Vlah, "Efficient Location Tracking Using Sensor Networks," in Proceedings of 2003 IEEE Wireless Communications and Networking Conference (WCNC), New Orleans, [3] E. Necula, "Dynamic traffic flow prediction based on GPS Data," in IEEE 26th International Conference on Tools with Artificial Intelligence, Limassol, [4] G. Ghiani, F. Guerriero, G. Laporte and R. Musmanno, "Real-time vehicle routing: Solution concepts, algorithms and parallel computing strategies," European Journal of Operational Research, vol. 151, no. 1, pp. 1-11, [5] G. Mintsis, S. Basbas, P. Papaioannou, C. Taxiltaris and I. N. Tziavos, "Applications of GPS technology in the land transportation system," European Journal od Operational Research, vol. 152, no. 2, pp , [6] "," [Online]. Available: [Accessed ]. [7] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann and R. Stutsman, "The case for RAMCloud," Communications of the ACM, vol. 54, no. 7, pp , July [8] "Hbase," [Online]. Available: [Accessed ]. [9] A. Fox, C. Eichelberger, J. Hughes and S. Lyon, "Spatio-temporal Indexing in Non-relational Distributed Databases," in IEEE International Conference on Big Data, Santa Clara, [10] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang and J. Saltz, "Hadoop- GIS: A High Performance Spatial Data Warehousing System over MapReduce," Proceedings of the VLDB Endowment, vol. 6, no. 11, pp , [11] H. Tan, W. Lou and L. M. Ni, "CloST: A Hadoop-based Storage System for Big Spatio-Temporal Data Analytics," in the 21st ACM International Conference on Information and Knowledge Management, New York, [12] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg and D. J. Abadi, "H-Store: A High-performance, distributed main memory transaction processing system," in Proceedings of the VLDB Endowment, [13] "memcached - a distributed memory object caching system," [Online]. Available: memcached.org. [Accessed June 2015]. [14] H. Lim, D. Han, D. G. Andersen and M. Kaminsky, "MICA: A Holistic Approach to Fast In-Memory Key-Value Storage," in 11th USENIX Symposium on Networked Systems Design and Implementation, Seattle, WA, USA, [15] S. M. Rumble, A. Kejriwal and J. Ousterhout, "Log-structured Memory for DRAM-based Storage," in The 12th USENIX Conference on File and Storage Technologies (FAST '14), Santa Clara, CA, USA, February, [16] M. F. Mokbel and W. G. Aref, "PLACE: A Scalable Location-aware Database Server for Spatio-temporal Data Streams," IEEE Data Engineering Bulletin, vol. 28(3), pp. 3-10, [17] J. Park, B. Hong, K. An and J. Jung, "A Unified Index for Moving- Objects Databases," in Computational Science and Its Applications - ICCSA 2006, Glasgow, [18] P. K. Akulakrishna, J. Lakshmi and S. K. Nandy, "Efficient Storage of Big-Data for Real-Time GPS Applications," in 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, Sydney, NSW, [19] J. Yu, F. Jiang and T. Zhu, "RTIC-C: A Big Data System for Massive Traffic Information Mining," in 2013 International Conference on Cloud Computing and Big Data, Fuzhou, [20] R. Escriva, B. Wong and E. G. Sirer, "HyperDex: a distributed, searchable key-value store," in ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, New York, NY, USA, [21] A. Guttman, "R-trees: a dynamic index structure for spatial searching," in SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data, New York, NY, USA, [22] J. L. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp , Sept [23] H. Samet, "The Quadtree and Related Hierarchical Data Structures," ACM Computing Surveys (CSUR), vol. 16, no. 2, pp , June, [24] J. Nievergelt, H. Hinterberger and K. C. Sevcik, "The Grid File: An Adaptive, Symmetric Multikey File Structure," ACM Transactions on Database Systems (TODS), vol. 9, no. 1, pp , [25] P. Cudre-Mauroux, E. Wu and S. Madden, "TrajStore: An adaptive Storate System for Very Large Trajectory Data Sets," in 26th International Conference on Data Engineering (ICDE 2010), California, [26] V. Botea, D. Mallett, M. A. Nascimento and J. Sander, "PIST: An Efficient and Practical Indexing Technique for Historical Spatio- Temporal Point Data," Geoinformatica, vol. 12, no. 2, pp , 2008.

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets Philippe Cudré-Mauroux Eugene Wu Samuel Madden Computer Science and Artificial Intelligence Laboratory Massachusetts Institute

More information

Bonsai: A Distributed Data Collection and Storage System for Recent and Historical Data Processing Kritphong Mongkhonvanit and Kai Da Zhao

Bonsai: A Distributed Data Collection and Storage System for Recent and Historical Data Processing Kritphong Mongkhonvanit and Kai Da Zhao Bonsai: A Distributed Data Collection and Storage System for Recent and Historical Data Processing Kritphong Mongkhonvanit and Kai Da Zhao Architecture Data Collection Initialization Announce IP and port

More information

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University HyperDex A Distributed, Searchable Key-Value Store Robert Escriva Bernard Wong Emin Gün Sirer Department of Computer Science Cornell University School of Computer Science University of Waterloo ACM SIGCOMM

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 12, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 12, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 12, 2016 ISSN (online): 2321-0613 Search Engine Big Data Management and Computing RagavanN 1 Athinarayanan S 2 1 M.Tech

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 First Grading for Reading Assignment Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets Philippe Cudré-Mauroux Eugene Wu Samuel Madden Computer Science and Artificial Intelligence Laboratory Massachusetts Institute

More information

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel Rosenblum and John Ousterhout) a Storage System

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data 46 Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

More information

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, and Ryan Stutsman) DRAM in Storage

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

At-most-once Algorithm for Linearizable RPC in Distributed Systems

At-most-once Algorithm for Linearizable RPC in Distributed Systems At-most-once Algorithm for Linearizable RPC in Distributed Systems Seo Jin Park 1 1 Stanford University Abstract In a distributed system, like RAM- Cloud, supporting automatic retry of RPC, it is tricky

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

Improving Throughput in Cloud Storage System

Improving Throughput in Cloud Storage System Improving Throughput in Cloud Storage System Chanho Choi chchoi@dcslab.snu.ac.kr Shin-gyu Kim sgkim@dcslab.snu.ac.kr Hyeonsang Eom hseom@dcslab.snu.ac.kr Heon Y. Yeom yeom@dcslab.snu.ac.kr Abstract Because

More information

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of

More information

Load balancing algorithms in cluster systems

Load balancing algorithms in cluster systems Load balancing algorithms in cluster systems Andrzej Imiełowski 1,* 1 The State Higher School of Technology and Economics in Jarosław, Poland Abstract. The beginning of XXI century brought dynamic rise

More information

Study of Load Balancing Schemes over a Video on Demand System

Study of Load Balancing Schemes over a Video on Demand System Study of Load Balancing Schemes over a Video on Demand System Priyank Singhal Ashish Chhabria Nupur Bansal Nataasha Raul Research Scholar, Computer Department Abstract: Load balancing algorithms on Video

More information

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination Richard Kershaw and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering, Viterbi School

More information

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS International Journal of Wireless Communications and Networking 3(1), 2011, pp. 7-13 CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS Sudhanshu Pant 1, Naveen Chauhan 2 and Brij Bihari Dubey 3 Department

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database

More information

Research on the Checkpoint Server Selection Strategy Based on the Mobile Prediction in Autonomous Vehicular Cloud

Research on the Checkpoint Server Selection Strategy Based on the Mobile Prediction in Autonomous Vehicular Cloud 2016 International Conference on Service Science, Technology and Engineering (SSTE 2016) ISBN: 978-1-60595-351-9 Research on the Checkpoint Server Selection Strategy Based on the Mobile Prediction in Autonomous

More information

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University MapReduce & HyperDex Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University 1 Distributing Processing Mantra Scale out, not up. Assume failures are common. Move processing to the data. Process

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers 2011 31st International Conference on Distributed Computing Systems Workshops SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers Lei Xu, Jian Hu, Stephen Mkandawire and Hong

More information

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Data Virtualization Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Introduction Caching is one of the most important capabilities of a Data Virtualization

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Storage Optimization with Oracle Database 11g

Storage Optimization with Oracle Database 11g Storage Optimization with Oracle Database 11g Terabytes of Data Reduce Storage Costs by Factor of 10x Data Growth Continues to Outpace Budget Growth Rate of Database Growth 1000 800 600 400 200 1998 2000

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com

More information

Computation of Multiple Node Disjoint Paths

Computation of Multiple Node Disjoint Paths Chapter 5 Computation of Multiple Node Disjoint Paths 5.1 Introduction In recent years, on demand routing protocols have attained more attention in mobile Ad Hoc networks as compared to other routing schemes

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto So slow So cheap So heavy So fast So expensive So efficient NAND based flash memory Retains memory without power It works by trapping a small

More information

An Cross Layer Collaborating Cache Scheme to Improve Performance of HTTP Clients in MANETs

An Cross Layer Collaborating Cache Scheme to Improve Performance of HTTP Clients in MANETs An Cross Layer Collaborating Cache Scheme to Improve Performance of HTTP Clients in MANETs Jin Liu 1, Hongmin Ren 1, Jun Wang 2, Jin Wang 2 1 College of Information Engineering, Shanghai Maritime University,

More information

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases Gurmeet Goindi Principal Product Manager Oracle Flash Memory Summit 2013 Santa Clara, CA 1 Agenda Relational Database

More information

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li Welcome to DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm Wednesday Location: Fuller 320 Spring 2017 2 Team assignment Finalized. (Great!) Guest Speaker 2/22 A

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Neha Shiraz, Dr. Parikshit N. Mahalle Persuing M.E, Department of Computer Engineering, Smt. Kashibai Navale College

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs 2013 DOE Visiting Faculty Program Project Report By Jianting Zhang (Visiting Faculty) (Department of Computer Science,

More information

China Big Data and HPC Initiatives Overview. Xuanhua Shi

China Big Data and HPC Initiatives Overview. Xuanhua Shi China Big Data and HPC Initiatives Overview Xuanhua Shi Services Computing Technology and System Laboratory Big Data Technology and System Laboratory Cluster and Grid Computing Laboratory Huazhong University

More information

Analysis of Cluster-Based Energy-Dynamic Routing Protocols in WSN

Analysis of Cluster-Based Energy-Dynamic Routing Protocols in WSN Analysis of Cluster-Based Energy-Dynamic Routing Protocols in WSN Mr. V. Narsing Rao 1, Dr.K.Bhargavi 2 1,2 Asst. Professor in CSE Dept., Sphoorthy Engineering College, Hyderabad Abstract- Wireless Sensor

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Shiori Toyoshima Ochanomizu University 2 1 1, Otsuka, Bunkyo-ku Tokyo

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

Switched Network Latency Problems Solved

Switched Network Latency Problems Solved 1 Switched Network Latency Problems Solved A Lightfleet Whitepaper by the Lightfleet Technical Staff Overview The biggest limiter to network performance is the control plane the array of processors and

More information

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant Big and Fast Anti-Caching in OLTP Systems Justin DeBrabant Online Transaction Processing transaction-oriented small footprint write-intensive 2 A bit of history 3 OLTP Through the Years relational model

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Albis: High-Performance File Format for Big Data Systems

Albis: High-Performance File Format for Big Data Systems Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Abstract With the growing use of cluster systems in file

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part IV Lecture 14, March 10, 015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part III Tree-based indexes: ISAM and B+ trees Data Warehousing/

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

CS454/654 Midterm Exam Fall 2004

CS454/654 Midterm Exam Fall 2004 CS454/654 Midterm Exam Fall 2004 (3 November 2004) Question 1: Distributed System Models (18 pts) (a) [4 pts] Explain two benefits of middleware to distributed system programmers, providing an example

More information

Documentation Accessibility. Access to Oracle Support

Documentation Accessibility. Access to Oracle Support Oracle NoSQL Database Availability and Failover Release 18.3 E88250-04 October 2018 Documentation Accessibility For information about Oracle's commitment to accessibility, visit the Oracle Accessibility

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

CA485 Ray Walshe NoSQL

CA485 Ray Walshe NoSQL NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

TrajAnalytics: A software system for visual analysis of urban trajectory data

TrajAnalytics: A software system for visual analysis of urban trajectory data TrajAnalytics: A software system for visual analysis of urban trajectory data Ye Zhao Computer Science, Kent State University Xinyue Ye Geography, Kent State University Jing Yang Computer Science, University

More information

Efficient Entity Matching over Multiple Data Sources with MapReduce

Efficient Entity Matching over Multiple Data Sources with MapReduce Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br

More information

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval New Oracle NoSQL Database APIs that Speed Insertion and Retrieval O R A C L E W H I T E P A P E R F E B R U A R Y 2 0 1 6 1 NEW ORACLE NoSQL DATABASE APIs that SPEED INSERTION AND RETRIEVAL Introduction

More information