G-Storm: GPU-enabled High-throughput Online Data Processing in Storm

Size: px
Start display at page:

Download "G-Storm: GPU-enabled High-throughput Online Data Processing in Storm"

Transcription

1 215 IEEE International Conference on Big Data (Big Data) G-: GPU-enabled High-throughput Online Data Processing in Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat and Charles Kamhoua Abstract The Single Instruction Multiple Data (SIMD) architecture of Graphic Processing Units (GPUs) makes them perfect for parallel processing of big data. In this paper, we present the design, implementation and evaluation of G-, a GPUenabled parallel system based on, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing. G- has the following desirable features: 1) G- is designed to be a general data processing platform as, which can handle various applications and data types. 2) G- exposes GPUs to applications while preserving its easy-to-use programming model. 3) G- achieves high-throughput and low-overhead data processing with GPUs. We implemented G- based on.9.2 and tested it using two different applications: continuous query and matrix multiplication. Extensive experimental results show that compared to, G- achieves over 7x improvement on throughput for continuous query, while maintaining reasonable average tuple processing time. It also leads to 2.3x throughput improvement for the matrix multiplication application. I. INTRODUCTION In the past decade, with the rapid rise of mobile and wearable devices, social media, just to name a few, data being created are getting bigger and bigger at an ever increasing pace. It is very challenging to extract valuable and meaningful information from huge-volume, fast and continuous data in a timely manner. MapReduce [5] and Hadoop [8] have emerged as the de facto programming model and system for dataintensive applications respectively. However, MapReduce and Hadoop are not suitable for online stream data applications because they were designed for offline batch processing of static data, in which all input data need to be stored on a distributed file system in advance. Recently, several generalpurpose distributed computing systems, such as [17] and S4 [16], have emerged as promising solutions for online processing of unbounded streams of continuous data at scale. Their architectures are quite different from MapReduce-based systems and they employ different message passing, fault tolerance and resource allocation mechanisms that are designed specifically for online stream data processing. A Graphics Processing Unit (GPU) is a massively-parallel, multi-threaded and many-core processor, whose peak computing power is considerably higher than that of a CPU. For example, a single NVIDIA GeForce GTX 68 GPU delivers Zhenhua Chen, Jielong Xu and Jian Tang are with the Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, {zchen3, jxu21, jtang2}@syr.edu. Both Kevin Kwiat and Charles Kamhoua are with the US Air Force Research Lab (AFRL), Rome, NY. The paper has been Approved for Public Release; Distribution Unlimited: 88ABW , Dated 2 Mar. 15. This research was supported in part by NSF grants # and # The information reported here does not reflect the position or the policy of the federal government. a peak performance of 3.9 Tera Floating Point Operations Per Second (TFLOPS) [7], which is over 28 times faster than that of a 6-core Intel Core i7 98 XE CPU. The Single Instruction Multiple Data (SIMD) architecture of GPUs makes them perfect for parallel processing of big data. However, research on stream processing with GPUs is still in its infancy. Research efforts have recently been made to utilize GPUs to accelerate MapReduce-based systems and applications, such as Mars [9], MapCG [1], etc. However, these solutions cannot be applied to solve our problem since the programming model and system architecture of an online stream data processing system are quite different from those of a MapReduce-based offline batch processing system. Moreover, scant research attention has been paid to leverage GPUs for accelerating online parallel stream data processing. In this paper, we present the design, implementation and evaluation of G-, a GPU-enabled parallel system based on, which harnesses massively parallel computing power of GPUs for high-throughput online stream data processing. A application is modeled as a directed graph called a topology, which usually includes two types of components: spouts and bolts. A spout is a source of data stream, while a bolt consumes tuples from spouts or other bolts, and processes them in a way defined by user code. Spouts and bolts can be executed as many tasks in parallel on multiple executors (i.e., threads), workers (i.e., processes) and worker nodes (i.e., machines) in a cluster. A GPU has quite a different programming model and architecture, which pose a few challenges for integrating it into that is originally designed to run only on CPUs. First, it is not trivial to expose GPUs to various applications, which have no knowledge of the underlying hardware that actually carries the workload. Second, is a complex event processing system which processes incoming events (tuples) one at a time as it arrives at each component. Under heavy workload, a cluster can become easily overloaded and processing can come to a halt. GPUs can be leveraged to offload workload to prevent overloading and enable high-throughput processing. To marry GPUs with stream data processing is very challenging and requires a new computation framework. It can be seen intuitively that it is inefficient if the GPU worked on one tuple at a time. It is essential to achieve high GPU utilization to amortize data transfer overhead while keeping processing latency within a reasonable time. The design and implementation of G- carefully addresses these challenges, which leads to the following desirable features: 1) G- is designed to be a general (rather than application-specific) data processing platform as, which can deal with various applications and data types. 2) G- exposes GPUs to applications while preserving its easy-to-use programming model by handling /15/$ IEEE 37

2 parallelization, data transfer and resource allocation automatically via a runtime system without user involvement. 3) G- achieves high-throughput and low-overhead da-ta processing by buffering multiple data tuples and offloading them in a batch to GPUs for highly-parallel processing. To the best of our knowledge, G- is the first GPUenabled online parallel stream data processing system that is built based on. We believe the proposed architecture and methods can be extended to other parallel stream data processing systems, such as Yahoo! s S4 [16], Google s MillWheel [1] and Microsoft s TimeStream [15], which have similar programming models and system architectures. Due to space limitation, we have to omit background introduction. The necessary background information can be found in the following references: [17], [21], GPU programming [4] and JCUDA [11]. II. A. Overview Fig. 1. DESIGN AND IMPLEMENTATION OF G-STORM $!"! #! # The architecture of G- Because of differences in programming models, in order to utilize GPUs in, a novel framework is needed. It is not trivial to simply store incoming tuples in buffers and copy them to and from the GPU. The number of data copy operations needs to be minimized for each kernel launch to reduce overhead while the amount of data copied each time needs to be sufficiently large for high-throughput processing. Each CUDA thread needs to know the type and location of input data and output space to operate on, and which tuple the data belong to. This requires knowledge of the precise location of each tuple in memory. We call this the index information of a tuple. G- s index construction is both innovative and robust in that it can deal with any input data type and batching size, while using minimal amount of memory footprint. The overall design of G- (Fig. 1) consists of a central control component called the GPU-executor, which runs as a thread on the CPU and launches user kernel to process incoming tuples on the GPU. There are three additional modules that enable parallel processing of multiple tuples on a GPU, whose functionalities are summarized in the following: 1) Batcher: It works along with the Index Builder and Buffer Manager (described next) for critical operations to batch incoming tuples and extracting output tuples. 2) Index Builder: It constructs indices for tuple arguments, which are their locations in memory. 3) Buffer Manager: It is responsible for allocating and initializing buffers to store tuples. The user needs to provide a PTX file containing a kernel that specifies how to process an incoming tuple on a GPU. Then a kernel wrapper needs to be generated to facilitate kernel execution across multiple tuples. In the original, the master node controlling a cluster runs a daemon called Nimbus, which is responsible for distributing user code around the cluster, and assigning executors of running topologies to workers and worker nodes. Supervisor is a daemon that runs on each worker node that listens for any work assigned to it by Nimbus. Next, we summarize the workflow of tuple processing in an GPU-executor at runtime: 1) When a tuple arrives at the GPU-executor, its Batcher adds the new tuple to a dedicated buffer. 2) Once the desired number of tuples to be buffered (i.e., batching size) is reached, the Batcher allocates the necessary input and output spaces on the GPU (device) memory and copies the input data (from tuples) from the host memory to the device memory to prepare for processing. 3) The Batcher calls the Index Builder to construct the index information for batching and extracting tuples (See Section II-C3). The index information is copied along with input data to the device memory so the kernel can extract the data and align threads to operate on the data. 4) The GPU-executor launches the kernel wrapper, which itself is a kernel function for locating input data and output memory space in the device memory. 5) The kernel wrapper launches the user kernel, utilizing multiple threads to process batched input tuples in parallel on the GPU. 6) After user kernel execution is complete, output data are stored in a pre-allocated output memory space, which is then copied back to the host at once to minimize transfer overhead. 7) The Batcher uses the (already generated) index information to extract individual output tuples. 8) The output tuples are emitted to the next stage in the topology as it is normally done. B. G- Programming Model The G- s programming model is very similar to s programming model, where a user builds a application (usually written in Java) by defining a topology consisting of spouts and bolts, and implement the spout and bolt classes by specifying how to emit and process tuples. Additionally, G- s model incorporates the GPU programming model as well by using JCUDA [11] as a bridge to the GPU from. JCUDA is a set of libraries that give Java programs easy 38

3 access to a GPU. JCUDA has a runtime API and a driver API that map to the CUDA runtime and driver APIs respectively. In G-, we introduce a specialized bolt, called a GPUbolt, which offloads tuple processing to a GPU. A GPUexecutor is a running instance of a GPU-bolt. Figure 2 shows the software stack of a GPU-executor. Fig. 2. Software stack of a GPU-executor To seamlessly integrate GPUs into, we create a GPU bolt base class, called BaseGPUBolt, which inherits s BaseRichBolt and implements its IRichBolt interface. BaseGPUBolt acts like a checklist and makes it convenient for a user to build his/her own bolts as it has to be done in. Specifically, a user can create a GPU bolt class by inheriting BaseGPUBolt and implementing its abstract methods for his/her custom application. Note that BaseGPUBolt also includes some concrete methods, such as prepare and execute, which leverages JCUDA for performing basic operations (such as initialization and kernel launching) for a GPU-executor (See Section II-C1). In addition, for a GPU-executor, a kernel and a kernel wrapper need to be provided by the user to perform custom tuple processing for his/her application. A kernel wrapper runs on the GPU with the user kernel to facilitate the mapping of tuples to CUDA threads. The user kernel only needs to define how one thread should operate on incoming tuple(s), such as resizing an image, or multiplying two matrices. Once tuples have been buffered by the Batcher and copied to the device memory, it is the job of the kernel wrapper to extract input tuple arguments, and output memory space locations associated with each tuple from index information so that the user kernel can be run with the correct number of threads in parallel and each thread knows which data to operate on. Note that the kernel wrapper can be manually written for each application, or it can be automatically generated because all index information is known on the host before data are transferred to the GPU. G- also allows a user to configure some parameters (such as batching size) of the Batcher (See Section II-C2) via its methods, which may make an impact on the performance of an application. For example, a user can specify the batching size (i.e., the max number of tuples to be buffered before being sent to a GPU) by calling the setmaxtuplecount method of the Batcher. These parameters should be pre-configured according to application needs. C. GPU-executor In this section, we describe the key modules of a GPUexecutor at runtime and their implementations in details. 1) Basic Operations: The BaseGPUBolt class includes two methods: prepare and execute, which perform basic operations for a GPU-executor. The prepare method is called only once when the GPU-executor is initialized, which performs the following necessary operations to prepare the GPU for user kernel execution: Initialize the CUDA driver. Create a new CUDA device. Return a handle to the actual device. Create a new CUDA context on the device and associate it with the GPU-executor. Locate the user kernel file. Compile the kernel file into PTX format and load it. Moreover, G- offers the user a chance to add his/her additional preparation operations in the userprepare method for flexibility. The original execute method in processes a single tuple at a time. The execute method in the BaseGPUBolt class uses the Batcher to buffer multiple incoming tuples and offload them in a batch to the GPU. When input data is resident on the GPU, it launches the kernel wrapper and user kernel for data processing. When kernel execution is complete, the output data are copied back to the host for further processing. Note that all these operations are performed by the GPUexecutor automatically without user involvement. In our implementation, we leverage the JCUDA library for conducting these operations. JCUDA provides methods for device and event management, such as allocating memory on a GPU and copying memory between the host and GPU. More importantly, it contains bindings to CUDA which allows for loading and launching CUDA kernels from Java [11]. 2) Batcher: The Batcher is a key module of G-. It is started immediately when the GPU-executor is initialized. The Batcher works along with the Index Builder and Buffer Manager (described next) for critical operations to batch incoming tuples and extracting output tuples. Specifically, the Batcher performs the following operations: Calls the Buffer Manager to construct a tuple buffer of sufficient size in host memory to hold incoming tuples. As each tuple arrives, again it uses the Buffer Manager to add the tuple to a pre-allocated tuple buffer. If the number of buffered tuples reaches the batching size, it first calls the Index Builder to construct an index buffer (described next) containing the working parameters of each tuple. It allocates memory on the GPU to hold the tuple buffer. It calls the Buffer Manager to copy the buffered tuples from the host memory to the device memory. After data processing on the GPU is complete, it copies the output data from the GPU back to the host. Using the index information constructed by the Index Builder, it extracts each individual output tuple from the output data. 39

4 We use a single tuple buffer for multiple tuples to reduce the number of data transfers over the slow PCIe bus. The front of the tuple buffer contains input tuples. The next portion is the index buffer containing the index information that is generated by the Index Builder (explained next). Following the index buffer is the output space, which is used to store output data. 3) Index Builder: The Index Builder is responsible for constructing indices for tuple arguments, which are their locations in memory. We use a dedicated buffer called the index buffer to store the information. The size of the index buffer is calculated based on the batching size and the number of arguments of each tuple. The design of the index is essential to the batching process because the Index Builder needs to provide to the kernel wrapper at runtime the precise indices for arguments of input tuples, and locations of output tuples of varying batching sizes. The structure of the index buffer is shown in detail in Figure 3. The first portion of the index buffer stores the total Fig. 3. Index buffer!! number of threads to be used during user kernel execution. The next portion is the thread group index. A thread group is a flexible sized group of threads, which can be adjusted for different applications. Because the tuple format in different applications may be different, in some cases one thread may operate on one tuple, in other cases (such as large matrices), one thread may operate on one element (e.g. two numbers in matrices) of the tuple. Thread group sizes should be multiples of a GPU streaming processor s warp size, which is usually 32. It may take multiple thread groups to map a tuple to threads. Within each thread group, we store three values needed to extract the tuple information by the kernel wrapper. First, how many threads are to be used within each group, a thread group may not have all threads in use. For example, if a thread group size is 32, and a tuple has 4 elements, the second thread group only needs to use 8 threads for execution, not all 32. Extra threads in the kernel can return immediately if there is no work for them to do. Second, since a tuple may span multiple thread groups, we need to know which thread group maps to which tuple. In the previous example, there are two thread groups that map to a single tuple and we store their group numbers as 1 and 2. Lastly, we store a pointer to where argument pointers are stored for each thread group. In this space, pointers to arguments of the tuple are stored contiguously for each thread group so they also can be easily extracted in the kernel wrapper. These pointers point to actual locations where input arguments and output arguments are located in memory. This design leads to a structure that is small in size, while providing all the necessary information to the kernel wrapper for processing tuples of any batching size. 4) Buffer Manager: The Buffer Manager is responsible for allocating and initializing buffers to store tuples. As described above, when the Batcher starts, the first thing it does is to call the initbuffers method to allocate a single tuple buffer of sufficient size calculated based on the batching size and tuple argument information known a priori. The tuple buffer has to be large enough not only to hold input data, but also the expected output data. As each tuple arrives, the Batcher asks the Buffer Manager for the appropriate buffer to store the tuple. Arguments of each tuple are stored directly in the single tuple buffer. A parameter that may be common to all tuples is stored into an additional buffer so we don t have unnecessary waste of memory space. For example, if a tuple is a matrix, and the application is to transform it by adding a value x to each element in the matrix, then x can be stored in the additional buffer. Note that in some cases, we may need to store a common vector of values into the additional buffer. By storing such parameters separately, we do not need to replicate them for each tuple, which can significantly save space. For output data, we also need to reserve memory space in the tuple buffer. In addition to input tuples and output space, the index buffer also needs to be inserted into the tuple buffer. The Buffer Manager first reserves the space for the index buffer and later when the indices have been calculated by the Index Builder, it is called to insert the index buffer to form the complete tuple buffer. The actual copying of data from a host to a GPU is also handled by the Buffer Manager. III. PERFORMANCE EVALUATION In this section, we first describe the experimental settings, and then present results and the corresponding analysis. We implemented the proposed G- system based on.9.2 release (obtained from s repository on Apache Software Foundation [18]), JCUDA [11] and CUDA 6.5 [4], and installed it on top of Ubuntu We performed real experiments with the NVIDIA Quadro K52 GPU. We evaluated the performance of G- in terms of throughput. We utilized s default performance reporting system, s UI, to collect many useful performance statistics such as the number of tuples emitted, acked, and failed. The throughput is the number of tuples fully acknowledged within one minute interval. We compared G- with the original that runs only on CPUs by conducting extensive experiments using two typical online data processing applications (topologies), namely, Continuous Query and Matrix Multiplication (stream version). In the following, we describe these applications and how they were run in details, and then present and analyze the corresponding experimental results. 31

5 Throughput (tuples/min) G Throughput (tuples/min) x 14 G Throughput (tuples/min) 4 x G (a) Batching Size = (b) Batching Size = (c) Batching Size = 512 Fig. 4. Throughput on the continuous query topology A. Continuous Query A simple select query works by initializing access to a database table and looping over each row to check if there is a hit [2]. To implement this application in, we designed a topology consisting of a spout and two bolts. The spout continuously emits a query, which is randomly generated. The second component is the query bolt, which can be either a CPU or a GPU bolt. The table is placed in memory on the host for or on the GPU for G-. This bolt takes the query from the spout and iterates over the table to see if there is a matching entry. A hit on a matching record is then emitted to the last component, a printer bolt, in the topology. This bolt simply takes the matching record and writes it to file. In our experiment, we randomly generated a database table with vehicle plates, their owners information (e.g. names and SSNs), and we randomly generated queries to search the table to find the owners of those speeding vehicles (vehicle speed is randomly generated and included in every tuple). We performed several experiments on both and G- using this application and presented the corresponding results in Figure 4. The running time for each experiment is 1 minutes. In the first experiment, we set G- s batching size to 128. From Figure 4(a), we can see the first minute of execution shows relatively low throughput for both G- and. After the first minute, G- is able to achieve a stable throughout. At the 1th minute, G- achieves a throughput of 9,163 tuples/min, compared to at 4,722 tuples/min, which represents a 1.94x improvement. Over the entire running time, G- consistently outperforms and achieves an average of 1.9x improvement on throughput. In the second experiment, we increased the batching size to 256. At the 1th min, G- has a throughput of 17,75 tuples/min, which gives an increase by a factor of From Figure 4(b), we can see that on average, G- outperforms by 3.7 times in terms of throughput. In the third experiment, as shown in Figure 4(c) we further increased the batching size to 512. We observed on average the batching size increase leads to 7x improvement on throughput over. From these results, we observe that 1) G- consistently outperforms in terms of throughput. 2) Increasing batching size leads to more significant improvement. In G-, multiple tuples are processed in a batch by a GPU, which leads to higher throughput but may cause latency for individual tuples since they have to wait for a little bit before being offloaded to the GPU and moreover, data transfers cause additional latency. To show the trade-off between batching size and latency, we again used the UI to collect tuple processing time of individual tuples, and present the average tuple processing time in Figure 5. As expected, larger batching size leads to longer average tuple processing time due to longer waiting time in the buffer on the host. However, we notice that the cost of increasing batching size on individual tuple processing time is rather negligible, compared to the significant improvement on throughput shown in Figure 4. Moreover, we can see that when G- stabilizes, the average tuple processing time is generally below.9s, which is acceptable for online data processing. Fig. 5. Avg. Tuple Proc. Time (s) Batching Size=128 Batching Size=256 Batching Size= Average tuple processing time B. Matrix Multiplication The second application is a typical numerical computing application, Matrix Multiplication. For this application, we first prepared input files containing matrices. We generated two files each containing around 9K randomly generated 8 8 double-precision matrices. To feed the matrices into the topology, we used an open-source tool called LogStash [14] that can read data from file and push them into a Redis queue. Two LogStash instances are used to read two files into two Redis queue channels. In the topology, the first component is a ReadMatrix spout that continuously extracts data from the queue channels to form a tuple. Each tuple contains two matrices. The spout then emits the tuple to a bolt. Similarly, this bolt can be either a CPU bolt or a GPU bolt, which extracts the two input matrices and perform the multiplication operation 311

6 on a CPU or a GPU. The last stage of this topology is a printer bolt which writes the results out to a file. We set the batching size to 16. As shown in Figure 6, we can see that G- performs consistently better than. At 18 seconds, G- achieves a throughput of 4, 341 tuples/min, compared to at 1,777 tuples/min, which represents a 2.4x improvement. On average, G- offers a 2.3x improvement over on throughput. This application demonstrates the robustness of G- in that it can work with different applications by adjusting the batching size. In these applications, the batching size can be relatively small value because G- can map an element of each tuple to a CUDA thread so that it can still fully utilize resources on a GPU even with small number of tuples. Throughput (tuples/min) G Running Time (x3s) Fig. 6. Throughput of matrix multiplication application (Batching Size = 16) IV. RELATED WORK Mars [9] was the first known GPU-accelerated MapReduce system, which includes new user-defined and system APIs and a runtime system (based on NVIDIA GPUs). It needs to spend extra time on counting the size of intermediate results due to lack of dynamic memory allocation on a GPU at the time when the paper was written. Another MapReduce framework, MapCG [1] was developed to improve the performance of Mars, which employs a light-weight memory allocator to enable efficient dynamic memory allocation as well as a hash table for storing intermediate key-value pairs to avoid the shuffling overhead. StreamMR [6] is a similar system, which, however, was developed particularly for AMD GPUs. Efficient methods were introduced in [3], [12] to leverage shared memory on a GPU for further speedup. The authors of [19] introduced the development of a MapReduce-based system on a GPU cluster and presented several optimization mechanisms. Other related research efforts include MATE- CG [13] and Pamar [2]. G- was built based on for general online stream data processing, with the addition of GPU acceleration, which has a quite different programming model and architecture. V. CONCLUSIONS In this paper, we presented the design, implementation and evaluation of G-, a GPU-enabled parallel platform based on that leverages massively parallel computing power of GPUs for high-throughput online stream data processing. G- was designed to be a general-purpose data processing platform as, which can support various applications and data types. G- seamlessly integrates GPUs into while preserving its easy-to-use programming model. Moreover, G- achieves high-throughput and low-overhead data processing with GPUs by buffering reasonable number of tuples and sending them in a batch to a GPU. We implemented G- based on.9.2 with JCUDA, CUDA 6.5, and the NVIDIA K52 GPU. We tested it with two different applications, including continuous query and matrix multiplication. It has been shown by extensive experimental results that compared to, G- achieves over7x improvement on throughput on continuous query while maintaining reasonable average tuple processing time. It also leads to 2.3x throughput improvements on the matrix multiplication application. REFERENCES [1] T. Akidau, A. Balikov, K. Bejiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 213, pp [2] P. Bakkum, K. Skadron, Accelerating SQL database operations on a GPU with CUDA. 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 1). pp [3] L. Chen and G. Agrawal, Optimizing MapReduce for GPUs with effective shared memory usage, ACM HPDC 212, pp [4] CUDA C Programming Guide, [5] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of USENIX OSDI 24. [6] M. Elteiry, H. Liny, W. Fengy, and T. Scogland, StreamMR: an optimized MapReduce framework for AMD GPUs, Proceedings of IEEE ICPADS 211, pp [7] GeForce Kepler GK1 basic specs leaked, geforce-kepler-gk11-basic-specs-leaked [8] Apache Hadoop, [9] B. He, W. Fang, Q. Luo, N. K. Govindaraju and T. Wang, Mars: A MapReduce framework with graphics processors, Proceedings of ACM PACT 28, pp [1] C. Hong, D. Chen, W. Chen, W. Zheng and H. Lin, MapCG: writing parallel program portable between CPU and GPU, Proceedings of ACM PACT 21, pp [11] Java bindings for the CUDA runtime and driver API, [12] F. Ji and X. Ma, Using shared memory to accelerate MapReduce on graphics processing units, IEEE IPDPS 211, pp [13] W. Jiang and G. Agrawal, MATE-CG: a MapReduce-like framework for accelerating data-intensive computations on heterogeneous clusters, Proceesings of IEEE IPDPS 211, pp [14] Logstash - Open Source Log Management, [15] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang, Timestream: reliable stream computation in the cloud, Proceedings of EuroSys 213. [16] S4, [17], [18].9.2 on Apache Software Foundation, [19] J. A. Stuart and J. D. Owens, Multi-GPU MapReduce on GPU clusters, Proceesings of IEEE IPDPS 211, pp [2] Y. S. Tan, B-S. Lee B. He and R. H.Campbell, A Map-Reduce based framework for heterogeneous processing element cluster environments, Proceedings of IEEE/ACM CCGrid 212, pp [21] J. Xu, Z. Chen, J. Tang, S. Su, T-: Traffic-Aware Online Scheduling in, Proceedings of IEEE/ICDCS

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators Gemini: Boosting Spark Performance with GPU Accelerators Guanhua Wang Zhiyuan Lin Ion Stoica AMPLab EECS AMPLab UC Berkeley UC Berkeley UC Berkeley Abstract Compared with MapReduce, Apache Spark is more

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Accelerating Spark RDD Operations with Local and Remote GPU Devices

Accelerating Spark RDD Operations with Local and Remote GPU Devices Accelerating Spark RDD Operations with Local and Remote GPU Devices Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani Dept.ofICS,KeioUniversity, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan 223-8522 Email:

More information

Streaming & Apache Storm

Streaming & Apache Storm Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the

More information

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters 2nd IEEE International Conference on Cloud Computing Technology and Science Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters Koichi Shirahata,HitoshiSato and Satoshi Matsuoka Tokyo Institute

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Flying Faster with Heron

Flying Faster with Heron Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

Tutorial: Apache Storm

Tutorial: Apache Storm Indian Institute of Science Bangalore, India भ रत य वज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences DS256:Jan17 (3:1) Tutorial: Apache Storm Anshu Shukla 16 Feb, 2017 Yogesh Simmhan

More information

Accelerating Braided B+ Tree Searches on a GPU with CUDA

Accelerating Braided B+ Tree Searches on a GPU with CUDA Accelerating Braided B+ Tree Searches on a GPU with CUDA Jordan Fix, Andrew Wilkes, Kevin Skadron University of Virginia Department of Computer Science Charlottesville, VA 22904 {jsf7x, ajw3m, skadron}@virginia.edu

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

STORM AND LOW-LATENCY PROCESSING.

STORM AND LOW-LATENCY PROCESSING. STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B. APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.Kadu PREC, Loni, India. ABSTRACT- Today in the world of

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common

More information

Twitter Heron: Stream Processing at Scale

Twitter Heron: Stream Processing at Scale Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT

REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT ABSTRACT Du-Hyun Hwang, Yoon-Ki Kim and Chang-Sung Jeong Department of Electrical Engineering, Korea University, Seoul, Republic

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Concurrency for data-intensive applications

Concurrency for data-intensive applications Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive

More information

An Efficient Execution Scheme for Designated Event-based Stream Processing

An Efficient Execution Scheme for Designated Event-based Stream Processing DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba

More information

Self Regulating Stream Processing in Heron

Self Regulating Stream Processing in Heron Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion

More information

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% openbench Labs Executive Briefing: May 20, 2013 Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

REAL-TIME ANALYTICS WITH APACHE STORM

REAL-TIME ANALYTICS WITH APACHE STORM REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Challenges in Data Stream Processing

Challenges in Data Stream Processing Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Getting Started with MATLAB Francesca Perino

Getting Started with MATLAB Francesca Perino Getting Started with MATLAB Francesca Perino francesca.perino@mathworks.it 2014 The MathWorks, Inc. 1 Agenda MATLAB Intro Importazione ed esportazione Programmazione in MATLAB Tecniche per la velocizzazione

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais Student of Doctoral Program of Informatics Engineering Faculty of Engineering, University of Porto Porto, Portugal

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Map-Reduce Processing of K-means Algorithm with FPGA-accelerated Computer Cluster

Map-Reduce Processing of K-means Algorithm with FPGA-accelerated Computer Cluster Map-Reduce Processing of K-means Algorithm with -accelerated Computer Cluster Yuk-Ming Choi, Hayden Kwok-Hay So Department of Electrical and Electronic Engineering The University of Hong Kong {ymchoi,

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform

Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform By Rudraneel Chakraborty A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information