G-Storm: GPU-enabled High-throughput Online Data Processing in Storm
|
|
- Maria Richard
- 5 years ago
- Views:
Transcription
1 215 IEEE International Conference on Big Data (Big Data) G-: GPU-enabled High-throughput Online Data Processing in Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat and Charles Kamhoua Abstract The Single Instruction Multiple Data (SIMD) architecture of Graphic Processing Units (GPUs) makes them perfect for parallel processing of big data. In this paper, we present the design, implementation and evaluation of G-, a GPUenabled parallel system based on, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing. G- has the following desirable features: 1) G- is designed to be a general data processing platform as, which can handle various applications and data types. 2) G- exposes GPUs to applications while preserving its easy-to-use programming model. 3) G- achieves high-throughput and low-overhead data processing with GPUs. We implemented G- based on.9.2 and tested it using two different applications: continuous query and matrix multiplication. Extensive experimental results show that compared to, G- achieves over 7x improvement on throughput for continuous query, while maintaining reasonable average tuple processing time. It also leads to 2.3x throughput improvement for the matrix multiplication application. I. INTRODUCTION In the past decade, with the rapid rise of mobile and wearable devices, social media, just to name a few, data being created are getting bigger and bigger at an ever increasing pace. It is very challenging to extract valuable and meaningful information from huge-volume, fast and continuous data in a timely manner. MapReduce [5] and Hadoop [8] have emerged as the de facto programming model and system for dataintensive applications respectively. However, MapReduce and Hadoop are not suitable for online stream data applications because they were designed for offline batch processing of static data, in which all input data need to be stored on a distributed file system in advance. Recently, several generalpurpose distributed computing systems, such as [17] and S4 [16], have emerged as promising solutions for online processing of unbounded streams of continuous data at scale. Their architectures are quite different from MapReduce-based systems and they employ different message passing, fault tolerance and resource allocation mechanisms that are designed specifically for online stream data processing. A Graphics Processing Unit (GPU) is a massively-parallel, multi-threaded and many-core processor, whose peak computing power is considerably higher than that of a CPU. For example, a single NVIDIA GeForce GTX 68 GPU delivers Zhenhua Chen, Jielong Xu and Jian Tang are with the Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, {zchen3, jxu21, jtang2}@syr.edu. Both Kevin Kwiat and Charles Kamhoua are with the US Air Force Research Lab (AFRL), Rome, NY. The paper has been Approved for Public Release; Distribution Unlimited: 88ABW , Dated 2 Mar. 15. This research was supported in part by NSF grants # and # The information reported here does not reflect the position or the policy of the federal government. a peak performance of 3.9 Tera Floating Point Operations Per Second (TFLOPS) [7], which is over 28 times faster than that of a 6-core Intel Core i7 98 XE CPU. The Single Instruction Multiple Data (SIMD) architecture of GPUs makes them perfect for parallel processing of big data. However, research on stream processing with GPUs is still in its infancy. Research efforts have recently been made to utilize GPUs to accelerate MapReduce-based systems and applications, such as Mars [9], MapCG [1], etc. However, these solutions cannot be applied to solve our problem since the programming model and system architecture of an online stream data processing system are quite different from those of a MapReduce-based offline batch processing system. Moreover, scant research attention has been paid to leverage GPUs for accelerating online parallel stream data processing. In this paper, we present the design, implementation and evaluation of G-, a GPU-enabled parallel system based on, which harnesses massively parallel computing power of GPUs for high-throughput online stream data processing. A application is modeled as a directed graph called a topology, which usually includes two types of components: spouts and bolts. A spout is a source of data stream, while a bolt consumes tuples from spouts or other bolts, and processes them in a way defined by user code. Spouts and bolts can be executed as many tasks in parallel on multiple executors (i.e., threads), workers (i.e., processes) and worker nodes (i.e., machines) in a cluster. A GPU has quite a different programming model and architecture, which pose a few challenges for integrating it into that is originally designed to run only on CPUs. First, it is not trivial to expose GPUs to various applications, which have no knowledge of the underlying hardware that actually carries the workload. Second, is a complex event processing system which processes incoming events (tuples) one at a time as it arrives at each component. Under heavy workload, a cluster can become easily overloaded and processing can come to a halt. GPUs can be leveraged to offload workload to prevent overloading and enable high-throughput processing. To marry GPUs with stream data processing is very challenging and requires a new computation framework. It can be seen intuitively that it is inefficient if the GPU worked on one tuple at a time. It is essential to achieve high GPU utilization to amortize data transfer overhead while keeping processing latency within a reasonable time. The design and implementation of G- carefully addresses these challenges, which leads to the following desirable features: 1) G- is designed to be a general (rather than application-specific) data processing platform as, which can deal with various applications and data types. 2) G- exposes GPUs to applications while preserving its easy-to-use programming model by handling /15/$ IEEE 37
2 parallelization, data transfer and resource allocation automatically via a runtime system without user involvement. 3) G- achieves high-throughput and low-overhead da-ta processing by buffering multiple data tuples and offloading them in a batch to GPUs for highly-parallel processing. To the best of our knowledge, G- is the first GPUenabled online parallel stream data processing system that is built based on. We believe the proposed architecture and methods can be extended to other parallel stream data processing systems, such as Yahoo! s S4 [16], Google s MillWheel [1] and Microsoft s TimeStream [15], which have similar programming models and system architectures. Due to space limitation, we have to omit background introduction. The necessary background information can be found in the following references: [17], [21], GPU programming [4] and JCUDA [11]. II. A. Overview Fig. 1. DESIGN AND IMPLEMENTATION OF G-STORM $!"! #! # The architecture of G- Because of differences in programming models, in order to utilize GPUs in, a novel framework is needed. It is not trivial to simply store incoming tuples in buffers and copy them to and from the GPU. The number of data copy operations needs to be minimized for each kernel launch to reduce overhead while the amount of data copied each time needs to be sufficiently large for high-throughput processing. Each CUDA thread needs to know the type and location of input data and output space to operate on, and which tuple the data belong to. This requires knowledge of the precise location of each tuple in memory. We call this the index information of a tuple. G- s index construction is both innovative and robust in that it can deal with any input data type and batching size, while using minimal amount of memory footprint. The overall design of G- (Fig. 1) consists of a central control component called the GPU-executor, which runs as a thread on the CPU and launches user kernel to process incoming tuples on the GPU. There are three additional modules that enable parallel processing of multiple tuples on a GPU, whose functionalities are summarized in the following: 1) Batcher: It works along with the Index Builder and Buffer Manager (described next) for critical operations to batch incoming tuples and extracting output tuples. 2) Index Builder: It constructs indices for tuple arguments, which are their locations in memory. 3) Buffer Manager: It is responsible for allocating and initializing buffers to store tuples. The user needs to provide a PTX file containing a kernel that specifies how to process an incoming tuple on a GPU. Then a kernel wrapper needs to be generated to facilitate kernel execution across multiple tuples. In the original, the master node controlling a cluster runs a daemon called Nimbus, which is responsible for distributing user code around the cluster, and assigning executors of running topologies to workers and worker nodes. Supervisor is a daemon that runs on each worker node that listens for any work assigned to it by Nimbus. Next, we summarize the workflow of tuple processing in an GPU-executor at runtime: 1) When a tuple arrives at the GPU-executor, its Batcher adds the new tuple to a dedicated buffer. 2) Once the desired number of tuples to be buffered (i.e., batching size) is reached, the Batcher allocates the necessary input and output spaces on the GPU (device) memory and copies the input data (from tuples) from the host memory to the device memory to prepare for processing. 3) The Batcher calls the Index Builder to construct the index information for batching and extracting tuples (See Section II-C3). The index information is copied along with input data to the device memory so the kernel can extract the data and align threads to operate on the data. 4) The GPU-executor launches the kernel wrapper, which itself is a kernel function for locating input data and output memory space in the device memory. 5) The kernel wrapper launches the user kernel, utilizing multiple threads to process batched input tuples in parallel on the GPU. 6) After user kernel execution is complete, output data are stored in a pre-allocated output memory space, which is then copied back to the host at once to minimize transfer overhead. 7) The Batcher uses the (already generated) index information to extract individual output tuples. 8) The output tuples are emitted to the next stage in the topology as it is normally done. B. G- Programming Model The G- s programming model is very similar to s programming model, where a user builds a application (usually written in Java) by defining a topology consisting of spouts and bolts, and implement the spout and bolt classes by specifying how to emit and process tuples. Additionally, G- s model incorporates the GPU programming model as well by using JCUDA [11] as a bridge to the GPU from. JCUDA is a set of libraries that give Java programs easy 38
3 access to a GPU. JCUDA has a runtime API and a driver API that map to the CUDA runtime and driver APIs respectively. In G-, we introduce a specialized bolt, called a GPUbolt, which offloads tuple processing to a GPU. A GPUexecutor is a running instance of a GPU-bolt. Figure 2 shows the software stack of a GPU-executor. Fig. 2. Software stack of a GPU-executor To seamlessly integrate GPUs into, we create a GPU bolt base class, called BaseGPUBolt, which inherits s BaseRichBolt and implements its IRichBolt interface. BaseGPUBolt acts like a checklist and makes it convenient for a user to build his/her own bolts as it has to be done in. Specifically, a user can create a GPU bolt class by inheriting BaseGPUBolt and implementing its abstract methods for his/her custom application. Note that BaseGPUBolt also includes some concrete methods, such as prepare and execute, which leverages JCUDA for performing basic operations (such as initialization and kernel launching) for a GPU-executor (See Section II-C1). In addition, for a GPU-executor, a kernel and a kernel wrapper need to be provided by the user to perform custom tuple processing for his/her application. A kernel wrapper runs on the GPU with the user kernel to facilitate the mapping of tuples to CUDA threads. The user kernel only needs to define how one thread should operate on incoming tuple(s), such as resizing an image, or multiplying two matrices. Once tuples have been buffered by the Batcher and copied to the device memory, it is the job of the kernel wrapper to extract input tuple arguments, and output memory space locations associated with each tuple from index information so that the user kernel can be run with the correct number of threads in parallel and each thread knows which data to operate on. Note that the kernel wrapper can be manually written for each application, or it can be automatically generated because all index information is known on the host before data are transferred to the GPU. G- also allows a user to configure some parameters (such as batching size) of the Batcher (See Section II-C2) via its methods, which may make an impact on the performance of an application. For example, a user can specify the batching size (i.e., the max number of tuples to be buffered before being sent to a GPU) by calling the setmaxtuplecount method of the Batcher. These parameters should be pre-configured according to application needs. C. GPU-executor In this section, we describe the key modules of a GPUexecutor at runtime and their implementations in details. 1) Basic Operations: The BaseGPUBolt class includes two methods: prepare and execute, which perform basic operations for a GPU-executor. The prepare method is called only once when the GPU-executor is initialized, which performs the following necessary operations to prepare the GPU for user kernel execution: Initialize the CUDA driver. Create a new CUDA device. Return a handle to the actual device. Create a new CUDA context on the device and associate it with the GPU-executor. Locate the user kernel file. Compile the kernel file into PTX format and load it. Moreover, G- offers the user a chance to add his/her additional preparation operations in the userprepare method for flexibility. The original execute method in processes a single tuple at a time. The execute method in the BaseGPUBolt class uses the Batcher to buffer multiple incoming tuples and offload them in a batch to the GPU. When input data is resident on the GPU, it launches the kernel wrapper and user kernel for data processing. When kernel execution is complete, the output data are copied back to the host for further processing. Note that all these operations are performed by the GPUexecutor automatically without user involvement. In our implementation, we leverage the JCUDA library for conducting these operations. JCUDA provides methods for device and event management, such as allocating memory on a GPU and copying memory between the host and GPU. More importantly, it contains bindings to CUDA which allows for loading and launching CUDA kernels from Java [11]. 2) Batcher: The Batcher is a key module of G-. It is started immediately when the GPU-executor is initialized. The Batcher works along with the Index Builder and Buffer Manager (described next) for critical operations to batch incoming tuples and extracting output tuples. Specifically, the Batcher performs the following operations: Calls the Buffer Manager to construct a tuple buffer of sufficient size in host memory to hold incoming tuples. As each tuple arrives, again it uses the Buffer Manager to add the tuple to a pre-allocated tuple buffer. If the number of buffered tuples reaches the batching size, it first calls the Index Builder to construct an index buffer (described next) containing the working parameters of each tuple. It allocates memory on the GPU to hold the tuple buffer. It calls the Buffer Manager to copy the buffered tuples from the host memory to the device memory. After data processing on the GPU is complete, it copies the output data from the GPU back to the host. Using the index information constructed by the Index Builder, it extracts each individual output tuple from the output data. 39
4 We use a single tuple buffer for multiple tuples to reduce the number of data transfers over the slow PCIe bus. The front of the tuple buffer contains input tuples. The next portion is the index buffer containing the index information that is generated by the Index Builder (explained next). Following the index buffer is the output space, which is used to store output data. 3) Index Builder: The Index Builder is responsible for constructing indices for tuple arguments, which are their locations in memory. We use a dedicated buffer called the index buffer to store the information. The size of the index buffer is calculated based on the batching size and the number of arguments of each tuple. The design of the index is essential to the batching process because the Index Builder needs to provide to the kernel wrapper at runtime the precise indices for arguments of input tuples, and locations of output tuples of varying batching sizes. The structure of the index buffer is shown in detail in Figure 3. The first portion of the index buffer stores the total Fig. 3. Index buffer!! number of threads to be used during user kernel execution. The next portion is the thread group index. A thread group is a flexible sized group of threads, which can be adjusted for different applications. Because the tuple format in different applications may be different, in some cases one thread may operate on one tuple, in other cases (such as large matrices), one thread may operate on one element (e.g. two numbers in matrices) of the tuple. Thread group sizes should be multiples of a GPU streaming processor s warp size, which is usually 32. It may take multiple thread groups to map a tuple to threads. Within each thread group, we store three values needed to extract the tuple information by the kernel wrapper. First, how many threads are to be used within each group, a thread group may not have all threads in use. For example, if a thread group size is 32, and a tuple has 4 elements, the second thread group only needs to use 8 threads for execution, not all 32. Extra threads in the kernel can return immediately if there is no work for them to do. Second, since a tuple may span multiple thread groups, we need to know which thread group maps to which tuple. In the previous example, there are two thread groups that map to a single tuple and we store their group numbers as 1 and 2. Lastly, we store a pointer to where argument pointers are stored for each thread group. In this space, pointers to arguments of the tuple are stored contiguously for each thread group so they also can be easily extracted in the kernel wrapper. These pointers point to actual locations where input arguments and output arguments are located in memory. This design leads to a structure that is small in size, while providing all the necessary information to the kernel wrapper for processing tuples of any batching size. 4) Buffer Manager: The Buffer Manager is responsible for allocating and initializing buffers to store tuples. As described above, when the Batcher starts, the first thing it does is to call the initbuffers method to allocate a single tuple buffer of sufficient size calculated based on the batching size and tuple argument information known a priori. The tuple buffer has to be large enough not only to hold input data, but also the expected output data. As each tuple arrives, the Batcher asks the Buffer Manager for the appropriate buffer to store the tuple. Arguments of each tuple are stored directly in the single tuple buffer. A parameter that may be common to all tuples is stored into an additional buffer so we don t have unnecessary waste of memory space. For example, if a tuple is a matrix, and the application is to transform it by adding a value x to each element in the matrix, then x can be stored in the additional buffer. Note that in some cases, we may need to store a common vector of values into the additional buffer. By storing such parameters separately, we do not need to replicate them for each tuple, which can significantly save space. For output data, we also need to reserve memory space in the tuple buffer. In addition to input tuples and output space, the index buffer also needs to be inserted into the tuple buffer. The Buffer Manager first reserves the space for the index buffer and later when the indices have been calculated by the Index Builder, it is called to insert the index buffer to form the complete tuple buffer. The actual copying of data from a host to a GPU is also handled by the Buffer Manager. III. PERFORMANCE EVALUATION In this section, we first describe the experimental settings, and then present results and the corresponding analysis. We implemented the proposed G- system based on.9.2 release (obtained from s repository on Apache Software Foundation [18]), JCUDA [11] and CUDA 6.5 [4], and installed it on top of Ubuntu We performed real experiments with the NVIDIA Quadro K52 GPU. We evaluated the performance of G- in terms of throughput. We utilized s default performance reporting system, s UI, to collect many useful performance statistics such as the number of tuples emitted, acked, and failed. The throughput is the number of tuples fully acknowledged within one minute interval. We compared G- with the original that runs only on CPUs by conducting extensive experiments using two typical online data processing applications (topologies), namely, Continuous Query and Matrix Multiplication (stream version). In the following, we describe these applications and how they were run in details, and then present and analyze the corresponding experimental results. 31
5 Throughput (tuples/min) G Throughput (tuples/min) x 14 G Throughput (tuples/min) 4 x G (a) Batching Size = (b) Batching Size = (c) Batching Size = 512 Fig. 4. Throughput on the continuous query topology A. Continuous Query A simple select query works by initializing access to a database table and looping over each row to check if there is a hit [2]. To implement this application in, we designed a topology consisting of a spout and two bolts. The spout continuously emits a query, which is randomly generated. The second component is the query bolt, which can be either a CPU or a GPU bolt. The table is placed in memory on the host for or on the GPU for G-. This bolt takes the query from the spout and iterates over the table to see if there is a matching entry. A hit on a matching record is then emitted to the last component, a printer bolt, in the topology. This bolt simply takes the matching record and writes it to file. In our experiment, we randomly generated a database table with vehicle plates, their owners information (e.g. names and SSNs), and we randomly generated queries to search the table to find the owners of those speeding vehicles (vehicle speed is randomly generated and included in every tuple). We performed several experiments on both and G- using this application and presented the corresponding results in Figure 4. The running time for each experiment is 1 minutes. In the first experiment, we set G- s batching size to 128. From Figure 4(a), we can see the first minute of execution shows relatively low throughput for both G- and. After the first minute, G- is able to achieve a stable throughout. At the 1th minute, G- achieves a throughput of 9,163 tuples/min, compared to at 4,722 tuples/min, which represents a 1.94x improvement. Over the entire running time, G- consistently outperforms and achieves an average of 1.9x improvement on throughput. In the second experiment, we increased the batching size to 256. At the 1th min, G- has a throughput of 17,75 tuples/min, which gives an increase by a factor of From Figure 4(b), we can see that on average, G- outperforms by 3.7 times in terms of throughput. In the third experiment, as shown in Figure 4(c) we further increased the batching size to 512. We observed on average the batching size increase leads to 7x improvement on throughput over. From these results, we observe that 1) G- consistently outperforms in terms of throughput. 2) Increasing batching size leads to more significant improvement. In G-, multiple tuples are processed in a batch by a GPU, which leads to higher throughput but may cause latency for individual tuples since they have to wait for a little bit before being offloaded to the GPU and moreover, data transfers cause additional latency. To show the trade-off between batching size and latency, we again used the UI to collect tuple processing time of individual tuples, and present the average tuple processing time in Figure 5. As expected, larger batching size leads to longer average tuple processing time due to longer waiting time in the buffer on the host. However, we notice that the cost of increasing batching size on individual tuple processing time is rather negligible, compared to the significant improvement on throughput shown in Figure 4. Moreover, we can see that when G- stabilizes, the average tuple processing time is generally below.9s, which is acceptable for online data processing. Fig. 5. Avg. Tuple Proc. Time (s) Batching Size=128 Batching Size=256 Batching Size= Average tuple processing time B. Matrix Multiplication The second application is a typical numerical computing application, Matrix Multiplication. For this application, we first prepared input files containing matrices. We generated two files each containing around 9K randomly generated 8 8 double-precision matrices. To feed the matrices into the topology, we used an open-source tool called LogStash [14] that can read data from file and push them into a Redis queue. Two LogStash instances are used to read two files into two Redis queue channels. In the topology, the first component is a ReadMatrix spout that continuously extracts data from the queue channels to form a tuple. Each tuple contains two matrices. The spout then emits the tuple to a bolt. Similarly, this bolt can be either a CPU bolt or a GPU bolt, which extracts the two input matrices and perform the multiplication operation 311
6 on a CPU or a GPU. The last stage of this topology is a printer bolt which writes the results out to a file. We set the batching size to 16. As shown in Figure 6, we can see that G- performs consistently better than. At 18 seconds, G- achieves a throughput of 4, 341 tuples/min, compared to at 1,777 tuples/min, which represents a 2.4x improvement. On average, G- offers a 2.3x improvement over on throughput. This application demonstrates the robustness of G- in that it can work with different applications by adjusting the batching size. In these applications, the batching size can be relatively small value because G- can map an element of each tuple to a CUDA thread so that it can still fully utilize resources on a GPU even with small number of tuples. Throughput (tuples/min) G Running Time (x3s) Fig. 6. Throughput of matrix multiplication application (Batching Size = 16) IV. RELATED WORK Mars [9] was the first known GPU-accelerated MapReduce system, which includes new user-defined and system APIs and a runtime system (based on NVIDIA GPUs). It needs to spend extra time on counting the size of intermediate results due to lack of dynamic memory allocation on a GPU at the time when the paper was written. Another MapReduce framework, MapCG [1] was developed to improve the performance of Mars, which employs a light-weight memory allocator to enable efficient dynamic memory allocation as well as a hash table for storing intermediate key-value pairs to avoid the shuffling overhead. StreamMR [6] is a similar system, which, however, was developed particularly for AMD GPUs. Efficient methods were introduced in [3], [12] to leverage shared memory on a GPU for further speedup. The authors of [19] introduced the development of a MapReduce-based system on a GPU cluster and presented several optimization mechanisms. Other related research efforts include MATE- CG [13] and Pamar [2]. G- was built based on for general online stream data processing, with the addition of GPU acceleration, which has a quite different programming model and architecture. V. CONCLUSIONS In this paper, we presented the design, implementation and evaluation of G-, a GPU-enabled parallel platform based on that leverages massively parallel computing power of GPUs for high-throughput online stream data processing. G- was designed to be a general-purpose data processing platform as, which can support various applications and data types. G- seamlessly integrates GPUs into while preserving its easy-to-use programming model. Moreover, G- achieves high-throughput and low-overhead data processing with GPUs by buffering reasonable number of tuples and sending them in a batch to a GPU. We implemented G- based on.9.2 with JCUDA, CUDA 6.5, and the NVIDIA K52 GPU. We tested it with two different applications, including continuous query and matrix multiplication. It has been shown by extensive experimental results that compared to, G- achieves over7x improvement on throughput on continuous query while maintaining reasonable average tuple processing time. It also leads to 2.3x throughput improvements on the matrix multiplication application. REFERENCES [1] T. Akidau, A. Balikov, K. Bejiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 213, pp [2] P. Bakkum, K. Skadron, Accelerating SQL database operations on a GPU with CUDA. 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 1). pp [3] L. Chen and G. Agrawal, Optimizing MapReduce for GPUs with effective shared memory usage, ACM HPDC 212, pp [4] CUDA C Programming Guide, [5] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of USENIX OSDI 24. [6] M. Elteiry, H. Liny, W. Fengy, and T. Scogland, StreamMR: an optimized MapReduce framework for AMD GPUs, Proceedings of IEEE ICPADS 211, pp [7] GeForce Kepler GK1 basic specs leaked, geforce-kepler-gk11-basic-specs-leaked [8] Apache Hadoop, [9] B. He, W. Fang, Q. Luo, N. K. Govindaraju and T. Wang, Mars: A MapReduce framework with graphics processors, Proceedings of ACM PACT 28, pp [1] C. Hong, D. Chen, W. Chen, W. Zheng and H. Lin, MapCG: writing parallel program portable between CPU and GPU, Proceedings of ACM PACT 21, pp [11] Java bindings for the CUDA runtime and driver API, [12] F. Ji and X. Ma, Using shared memory to accelerate MapReduce on graphics processing units, IEEE IPDPS 211, pp [13] W. Jiang and G. Agrawal, MATE-CG: a MapReduce-like framework for accelerating data-intensive computations on heterogeneous clusters, Proceesings of IEEE IPDPS 211, pp [14] Logstash - Open Source Log Management, [15] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang, Timestream: reliable stream computation in the cloud, Proceedings of EuroSys 213. [16] S4, [17], [18].9.2 on Apache Software Foundation, [19] J. A. Stuart and J. D. Owens, Multi-GPU MapReduce on GPU clusters, Proceesings of IEEE IPDPS 211, pp [2] Y. S. Tan, B-S. Lee B. He and R. H.Campbell, A Map-Reduce based framework for heterogeneous processing element cluster environments, Proceedings of IEEE/ACM CCGrid 212, pp [21] J. Xu, Z. Chen, J. Tang, S. Su, T-: Traffic-Aware Online Scheduling in, Proceedings of IEEE/ICDCS
CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators
Gemini: Boosting Spark Performance with GPU Accelerators Guanhua Wang Zhiyuan Lin Ion Stoica AMPLab EECS AMPLab UC Berkeley UC Berkeley UC Berkeley Abstract Compared with MapReduce, Apache Spark is more
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationAccelerating Spark RDD Operations with Local and Remote GPU Devices
Accelerating Spark RDD Operations with Local and Remote GPU Devices Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani Dept.ofICS,KeioUniversity, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan 223-8522 Email:
More informationStreaming & Apache Storm
Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the
More informationHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
2nd IEEE International Conference on Cloud Computing Technology and Science Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters Koichi Shirahata,HitoshiSato and Satoshi Matsuoka Tokyo Institute
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More informationFlying Faster with Heron
Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS
More informationBefore proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.
About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache
More informationTutorial: Apache Storm
Indian Institute of Science Bangalore, India भ रत य वज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences DS256:Jan17 (3:1) Tutorial: Apache Storm Anshu Shukla 16 Feb, 2017 Yogesh Simmhan
More informationAccelerating Braided B+ Tree Searches on a GPU with CUDA
Accelerating Braided B+ Tree Searches on a GPU with CUDA Jordan Fix, Andrew Wilkes, Kevin Skadron University of Virginia Department of Computer Science Charlottesville, VA 22904 {jsf7x, ajw3m, skadron}@virginia.edu
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS
ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationAccelerating MapReduce on a Coupled CPU-GPU Architecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationSTORM AND LOW-LATENCY PROCESSING.
STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationR-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell
R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationAPPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.
APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.Kadu PREC, Loni, India. ABSTRACT- Today in the world of
More informationReal-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments
Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!
More informationTyphoon: An SDN Enhanced Real-Time Big Data Streaming Framework
Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common
More informationTwitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationABSTRACT I. INTRODUCTION
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationREAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT ABSTRACT Du-Hyun Hwang, Yoon-Ki Kim and Chang-Sung Jeong Department of Electrical Engineering, Korea University, Seoul, Republic
More informationA priority based dynamic bandwidth scheduling in SDN networks 1
Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationAn Efficient Execution Scheme for Designated Event-based Stream Processing
DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba
More informationSelf Regulating Stream Processing in Heron
Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion
More informationCondusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%
openbench Labs Executive Briefing: May 20, 2013 Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationOffloading Java to Graphics Processors
Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationREAL-TIME ANALYTICS WITH APACHE STORM
REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationAn Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More informationDynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce
Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,
More informationJumbo: Beyond MapReduce for Workload Balancing
Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationChallenges in Data Stream Processing
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationGetting Started with MATLAB Francesca Perino
Getting Started with MATLAB Francesca Perino francesca.perino@mathworks.it 2014 The MathWorks, Inc. 1 Agenda MATLAB Intro Importazione ed esportazione Programmazione in MATLAB Tecniche per la velocizzazione
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationSurvey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais
Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais Student of Doctoral Program of Informatics Engineering Faculty of Engineering, University of Porto Porto, Portugal
More informationShark: Hive on Spark
Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationMap-Reduce Processing of K-means Algorithm with FPGA-accelerated Computer Cluster
Map-Reduce Processing of K-means Algorithm with -accelerated Computer Cluster Yuk-Ming Choi, Hayden Kwok-Hay So Department of Electrical and Electronic Engineering The University of Hong Kong {ymchoi,
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationINTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)
More informationA New HadoopBased Network Management System with Policy Approach
Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationThe Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI
2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationPriority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform
Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform By Rudraneel Chakraborty A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More information