Accelerating Spark RDD Operations with Local and Remote GPU Devices

Size: px
Start display at page:

Download "Accelerating Spark RDD Operations with Local and Remote GPU Devices"

Transcription

1 Accelerating Spark RDD Operations with Local and Remote GPU Devices Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani Dept.ofICS,KeioUniversity, Hiyoshi, Kohoku, Yokohama, Japan Abstract Apache Spark is a distributed processing framework for large-scale data sets, where intermediate data sets are represented as RDDs (Resilient Distributed Datasets) and stored in memory distributed over machines. To accelerate its various computation intensive operations, such as reduction and sort, we focus on GPU devices. We modified Spark framework to invoke CUDA kernels when computation intensive operations are called. RDDs are transformed into array structures and transferred to GPU devices when necessary. Although we need to cache RDDs in GPU device memory as much as possible in order to hide the data transfer overhead, the number of local GPU devices mounted in a host machine is limited. In this paper, we propose to use remote GPU devices which are connected to a host machine via a PCI-Express over 10Gbps Ethernet technology. To mitigate the data transfer overhead for remote GPU devices, we propose three RDD caching policies for local and remote GPU devices. We implemented various reduction programs (e.g., Sum, Max, LineCount) and transformation programs (e.g., SortByKey, PatternMatch, WordConversion) using local and remote GPU devices for Spark. Evaluation results show that Spark with GPU outperforms the original software by up to 21.4x. We also evaluate the RDD caching policies for local and remote GPU devices and show that a caching policy that minimizes the data transfer amount for remote GPU devices achieves the best performance. Keywords Apache Spark, RDD, GPU, CUDA, PCIe over 10GbE I. INTRODUCTION Due to the recent advances on mobile and IoT devices, social networking services, and cloud computing, data sets to be stored and analyzed are continuously growing in size and diversity. Distributed data processing framework is thus a key component to analyze such large data sets. Various distributed data processing frameworks are available. Apache Hadoop [1] is one of representative open-source products for this purpose. As a storage layer, it uses HDFS (Hadoop Distributed File System). For each processing, input data sets are read from the distributed file system and the processing results are stored again. This approach is faulttolerant as data sets are partitioned and stored in multiple nodes or disks. However, for applications that iteratively perform operations on the same data sets (e.g., machine learning), since input/output data sets are read/written from/to the distributed file system for each iteration, it requires a number of disk accesses, resulting in a disk access bottleneck. Apache Spark [2] is another representative open-source distributed data processing framework where intermediate data sets are stored in a distributed shared memory. More specifically, the data sets are represented as RDDs (Resilient Distributed Datasets) [3] and distributed over machines. Various operations can be performed on a given RDD as input data and the computation result is stored as another RDD. Thus, for applications that iteratively perform operations on the same data sets, we can drastically eliminate the disk access to read/write intermediate data. Spark framework provides various functions onto RDDs and they can be classified into two categories: Action and Transformation. The computation result of an Action function (e.g., Max, Sum) is a scalar or vector value, while that of a Transformation function (e.g., Sort) is another RDD. Since these functions are computation-intensive, in this paper, we accelerate them by using GPU (Graphics Processing Unit) devices. As data transfer between a host and GPU devices is a performance bottleneck, we try to cache RDDs in GPU device memory as much as possible in order to hide the data transfer overhead. However, since the number of local GPU devices mounted in a host machine is limited, cached RDDs in GPU device memory would be frequently swapped in and out. To mitigate the overhead, we propose to use remote GPU devices which are connected to a host machine via a PCI- Express (PCIe) over 10Gbps Ethernet (10GbE) technology. We can increase the number of remote GPU devices and the amount of available memory for parallel processing. However, data transfer overhead of remote GPU devices is higher than that of local GPU devices. To use both local and remote GPU devices efficiently, we need to take into account the size and access frequency of RDDs when we cache them in local or remote GPU devices. In this paper, we propose three RDD caching policies for local and remote GPU devices 1. The rest of paper is organized as follows. Section II overviews background knowledge. Section III illustrates our Array Cache design for GPU processing of Spark Action and Transformation functions, and Section IV extends it to local and remote GPU devices. Section V mentions the Array Cache implementation. Section VI evaluates various Spark operations using local and remote GPU devices. Section VII concludes this paper. 1 An early stage of this work appears at HEART 2016 as a short presentation, which is not included nor published in the conference post proceedings.

2 II. BACKGROUND AND RELATED WORK A. Spark and RDD Apache Spark is a distributed processing framework where data sets are represented as in-memory RDDs [3]. The framework is implemented in Scala language. Although HDFS is used as a data source, the intermediate results are stored as inmemory RDDs; thus it is efficient especially for applications that iteratively perform certain operations on the same data, such as machine learning. In Spark framework, a driver node distributes tasks and data sets to worker nodes and they perform the tasks on a given data. Then the driver node collects and aggregates the computation results from the worker nodes. In this paper, we focus on the worker nodes and modify them so that they invoke CUDA kernels when they perform computation intensive operations on RDDs. RDD operations can be classified into two categories: Action and Transformation functions. Action functions perform an aggregation operation on a given RDD and the computation result is a scalar or vector value. Transformation functions perform a conversion operation on a given RDD and the result is stored as a new RDD. As Action functions, in this paper, we accelerate reduce() 2, max(), min(), and count() by using GPU devices. In addition, as Transformation functions, we accelerate sortbykey(), map(), and filter() by using GPU devices. Please note that although the above-mentioned functions do not cover all the functions provided by Spark framework, they are representative ones in Spark and we can implement GPU processing for the remaining functions in the same way. Since intermediate data sets are represented as RDDs, a series of Action and Transformation functions can be represented as a graph where a node is an RDD and an edge is an Action or Transformation function. B. GPU Processing on Spark Framework HeteroSpark [4] is based on a Spark framework and it utilizes GPU devices in addition to CPUs. cuspark is another framework similar to Spark which performs parallel processing using GPU devices on a single machine [5]. However, detail implementations regarding how to process and cache RDDs for GPU devices are not discussed in detail. In this paper, we introduce Array Cache which is an array structure extracted from a given RDD and will be cached in the GPU device memory. Other than Spark, conventional MapReduce frameworks that employ GPU devices have been studied so far [6][7][8]. Mars [6] is an example of such GPU-based MapReduce frameworks. To utilize the massive thread parallelism of GPU devices, it provides a small set of APIs similar to CPU-based MapReduce framework. Optimization techniques that utilize the shared memory have been proposed in [7]. An OpenCL MapReduce framework optimized for AMD GPUs has been proposed in [8]. Our work is specialized for the in-memory RDDs of Spark and we propose to cache them in local and remote GPUs device memory. In [9], a document-oriented database is accelerated by utilizing GPU devices. To process a number of BSON documents with GPU devices, [9] proposes DDB Cache which is an array 2 Spark built-in function is denoted as function() in this paper. Fig. 1. Making Array Cache for GPU processing structure, where documents for the same field are extracted and represented as a Compressed Row Storage (CRS) format. A similar array structure based on CRS format is used for accelerating a graph database by using GPU devices [10]. Our Array Cache used in this paper for GPU acceleration of Spark also uses a similar array structure based on CRS format. In addition, we propose to use remote GPU devices which are connected to a host machine via PCIe over 10GbE, in order to increase the number of remote GPU devices when needed. We employ NEC ExpEther [11][12] as a PCIe over 10GbE technology (see Figure 3). Using ExpEther, PCIe packets for hardware resources (e.g., GPU devices) are transported in 10GbE by encapsulating the packets into an Ethernet frame. Please note that there is software service based on client-server model which provides GPU computation for clients [13], while we employ a PCIe over 10GbE technology for connecting a lot of GPU devices directly. Remote GPUs devices via such a PCIe over 10GbE technology are used for database processing [14], while they have not been studied for Spark nor MapReduce. Another contribution of this paper is to explore RDD caching policies based on a static analysis of a given Spark application for local and remote GPU devices. III. A. System Overview SPARK FRAMEWORK WITH GPU Figure 1 illustrates a flow of performing Action and Transformation functions of Spark using GPU device. First, a given data set is transformed into an RDD by using makerdd() function of Spark. Then, the RDD is converted to an array structure which is suitable for GPU processing by using collect() function of Spark. We call this array structure Array Cache. To perform Action and Transformation functions using GPU device, the Array Cache is transferred to the GPU device memory and a CUDA kernel is invoked. To reduce the conversion overhead, once Array Caches are created, they reside in a host memory so that they can be reused for the subsequent processing. In addition, once Array Caches are transferred to GPU device memory, they reside in a GPU device memory until they are replaced with newlytransferred Array Caches. A CUDA kernel corresponding to a given Action or Transformation function is performed for the Array Cache transferred in the GPU device memory and

3 2) An Array Cache is created from the RDD and their IDs are registered in RDD-Array Table. 3) The Array Cache is transferred to a GPU device memory and a CUDA kernel corresponding to reduce() function is performed for the Array Cache. To count the number of lines in a given RDD (i.e., LineCount), count() function of Spark framework is used. To find the maximum (minimum) value in a given RDD, max() (min()) is used. We implemented their corresponding CUDA kernels. Their procedures are similar to that of the Sum mentioned above. Fig. 2. Invoking GPU kernel from Spark framework the computation result is returned to the host. Array Cache implementation is mentioned in Section V. B. Array Cache of RDD Figure 2 illustrates how RDDs are cached as Array Caches and how a GPU kernel is invoked when Action or Transformation function is called. As shown, the used Array Caches reside in GPU device memory and can be reused for the subsequent processing. Here we introduce a table that manages a relationship between RDDs and Array Caches. We call this table RDD-Array Table. We implemented RDD-Array Table in Spark framework. When an Array Cache is created from an RDD, their IDs are stored in RDD-Array Table. When Transformation functions are performed, the result (i.e., new RDD) is recorded as a Result Array in the table. We implemented read/write functions of RDD-Array Table in Spark framework. When an Action or Transformation function that utilizes GPU device is called, RDD-Array Table is checked to see whether the target RDD has been transformed as Array Cache. If the Array Cache does not exist, the target RDD is then transformed into Array Cache and registered in RDD-Array Table. If it exists, the table is further checked to see whether the Array Cache resides in GPU device memory. If it does not reside in the GPU device, it is then transferred to a GPU device. At the GPU device side, a CUDA kernel corresponding to a given Action or Transformation function is executed. We employ jcuda [15] in order to invoke CUDA kernels from Spark framework implemented in Scala and Java languages. C. GPU Processing for Action As Action functions, here we illustrate GPU processing for Sum, LineCount, and Max (Min) programs as examples. To calculate the sum of elements in a given RDD, reduce() function of Spark framework is used. In the original Spark framework, the sum of the elements is calculated as follows. 2) reduce() function is performed for the RDD to calculate the sum of elements in the RDD. We implemented the corresponding CUDA kernel based on [16]. In our Spark framework that supports GPU processing, the procedure is changed as follows. D. GPU Processing for Transformation Transformation functions can be accelerated by GPU devices as well as Action functions. A major difference between Action and Transformation functions is that the computation result of Transformation is a new RDD. Thus, the computation result of GPU devices is transformed into an RDD by using makerdd() function of Spark. Here we illustrate GPU processing for SortByKey, PatternMatch, and WordConversion programs as examples. To sort key-value pairs by their keys in a given RDD, sortbykey() function of Spark is used. In the original Spark framework, key-value pairs are sorted by their keys as follows. 2) sortbykey() function is performed for the RDD to sort key-value pairs by their keys in the RDD. 3) The computation result is stored as a new RDD. We implemented a CUDA kernel of Bitonic sort algorithm. The procedure is changed as follows. 2) An Array Cache is created from the RDD and their IDs are registered in RDD-Array Table. 3) The Array Cache is transferred to a GPU device memory and a CUDA kernel corresponding to sort- ByKey() function is performed for the Array Cache. 4) The computation result of GPU is transformed into an RDD by using makerdd(). To extract the lines which are matched to a given pattern from a given text (i.e., PatternMatch), filter() function of Spark framework is used. To convert an input text to a specific output based on a given rule (i.e., WordConversion), map() function of Spark framework is used. We implemented their corresponding CUDA kernels. E. GPU Processing for Complex Processing We can combine Action and Transformation functions to perform more complex processing on a given RDD. Such compound processing can be accelerated by GPU devices. Here we illustrate GPU processing for calculating Variation of elements in a given RDD as an example. To calculate a variance of elements in a given RDD, both reduce() and map() functions are used. In the original Spark framework, a variance is calculated as follows.

4 Fig. 3. Remote GPU devices connected via 10GbE local and remote GPU devices. Since the number of GPU devices mounted in a host machine is limited by the available PCIe slots, the number and size of RDDs which can be cached in GPU device memories are also limited. If the GPU device memory capacity is not enough, Array Cache of RDDs may be frequently swapped out and in from GPU devices. Here we propose to combine a few local GPU devices and a number of remote GPU devices connected via PCIe over 10GbE so that we can expand the GPU device memory capacity for Array Cache. There are interesting and important design options regarding where RDDs should be cached, local or remote GPU devices. Actually, while the number of remote GPU devices can be expanded, the remote GPU devices may suffer communication overhead since they are connected to a host via 10GbE. While the number of local GPU devices is limited, they do not suffer a 10GbE communication overhead. Thus an efficient use of the local and remote GPU devices is a key for accelerating the Spark framework. Fig. 4. Spark with local and remote GPU devices 2) reduce() function is performed for the RDD to calculate an average value of elements in the RDD. 3) map() function is performed for the same RDD to calculate a square of the difference between each element value and the average value. 4) The computation result is stored as a new RDD. 5) reduce() function is performed for the new RDD to calculate an average value of elements in the new RDD. With GPU processing, the procedure is changed as follows. 2) An Array Cache is created from the RDD and their IDs are registered in RDD-Array Table. 3) The Array Cache is transferred to a GPU device memory and a CUDA kernel corresponding to reduce() function is performed for the Array Cache. 4) A CUDA kernel corresponding to map() function is performed for the Array Cache. 5) A CUDA kernel corresponding to reduce() function is performed for the Array Cache. An input RDD is transferred to a GPU device only once (Step 3). Then the three CUDA kernels are performed for the data stored in the GPU device memory (Steps 3, 4, and 5). IV. LOCAL AND REMOTE GPU DEVICES A. System Overview In this section, we extend our Spark framework to use multiple GPU devices connected to a host via PCIe (i.e., local GPU devices) and PCIe over 10GbE (i.e., remote GPU devices as shown in Figure 3). Figure 4 illustrates the framework with B. Caching Policies for Local and Remote GPUs Caching policies of RDDs that take into account the characteristics of local and remote GPU devices affect the performance. For example, we may improve the performance when the number of CUDA kernel executions on local GPU devices is maximized. Alternatively, we may improve the performance when the amount of data transfer between a host and remote GPU devices is minimized. In this paper, we explore the following three RDD caching policies. 1) Policy-1 that caches Array Caches on local/remote GPU devices so that the most-used RDD is cached in a local GPU device. 2) Policy-2 that caches Array Caches on local/remote GPU devices so that the number of data transfers between a host and remote GPU devices is minimized. 3) Policy-3 that caches Array Caches on local/remote GPU devices so that the total amount of data transfers between a host and remote GPU devices is minimized. Policy-1 tries to reduce the CUDA kernel execution overheads for remote GPU devices. Policy-2 and Policy-3 try to reduce the data transfers between a host and remote GPU devices, while Policy-2 focuses on the number and Policy-3 focuses on the amount. The data transfers can be classified into 1) input data transfers from a host to remote GPU devices before CUDA kernel execution and 2) output data transfers from remote GPU devices to a host after CUDA kernel execution. The output data transfer size depends on the type of function executed. Action functions return only a scalar or vector value, while Transformation functions return a new RDD. If CUDA kernels are performed on the same input data, the input data transfers can be eliminated except for the input data transfer for the first execution. Otherwise, the input data transfers are needed when Array Cache of RDD is swapped out/in from/to a GPU device memory. The three RDD caching policies are performed based on a static analysis of a given application source code running on Spark framework. More specifically, an RDD graph where a node is an RDD and an edge is a relationship between two

5 RDD(i).(program name) in source codes are counted by our code analysis program, the number of RDD accesses and the sum of the result sizes are analyzed. Based on the number of accesses and input/output sizes for each RDD in the RDD graph, our proposed RDD caching policies are performed to determine where the Array Cache is stored (i.e., local or remote GPU devices). For example, Policy- 1 places RDD1 to a local GPU device as RDD1 is the most frequently-used one. We will evaluate the three RDD caching policies in terms of application execution times in Section VI. Fig. 5. Creation of RDD graph RDDs (in the case of Transformation) or RDD and a value (in the case of Action) is extracted from the Spark application source code by using our code analysis program. Based on the RDD graph, we can estimate the number of accesses for each RDD in a given application by simply counting the number of edges on each node. Each node (i.e., RDD) has a size property in our RDD graph. Our static analysis program simply estimates a result size of each edge (i.e., Transformation or Action function) as 1) scalar-size, 2) smaller-size, 3) equal-size, or 4) larger-size. Actually the result size estimation highly depends on a given function and input data. For example, Action functions return a value, which is very small compared to an input RDD. As Transformation functions, sortbykey() returns a same-sized RDD compared to an input RDD, while filter() returns a smaller-sized RDD compared to an input RDD. In the case of filter(), the application programmers can feed an expected match rate as a hint for estimating the result size of each edge for better accuracy, while in this paper we simply estimate the result size as one of the above four patterns. Figure 5 illustrates an example of RDD graph extracted by our static analysis program. As shown, the number of accesses to RDD1 is three and that to RDD2 is one. When an input data set is loaded, the input data size is marked as 1. As a result size of Action functions, 0 indicates that the result is a scalar value. As result sizes of Transformation functions, 0.5, 1, and 1.5 indicate that the results are smaller than, equal to, and larger than an input RDD, respectively. For example, since LineCount program is based on an Action function, its result size is marked as 0. WordConversion program uses a Transformation function that performs a word conversion on a given text. Since the result size can be expected to be almost the same as an input text, its result size is marked as 1. PatternMatch program uses a Transformation function that extracts words from a given text based on a given rule, and the result size is assumed to be small but exact size is not known; thus its result size is marked as 0.5. Since the input data and the results of both the WordConversion and Filter programs performed on RDD1 should be saved for future use, the cumulative size of RDD1 is 2.5. To simplify code analysis, the i-th RDD used in evaluations is denoted as RDD(i). Here all the GPU programs and their result sizes are detected beforehand. Since lines including V. ARRAY CACHE IMPLEMENTATION To process an RDD by using GPUs, the RDD is converted to an Array Cache by using collect() function of Spark. Array Cache with array size, the original RDD ID, and the original data type are cached in a host main memory as a Java object. We prepare an array for these Java objects in a host main memory. When a Java object is created, it is recorded in the array. We call the set of these records RDD-Array Table. Since each record has an RDD ID and an Array Cache, we can easily find the target Array Cache. The registration and search times of RDD-Array Table are negligible compared to computation and transfer times. When a Spark function that utilizes GPUs is called, Array Cache corresponding to the target RDD is selected from RDD- Array Table and its size is compared to that of the GPU device memory. If the Array Cache size is smaller, it is transferred to the GPU device memory by using jcuda. Then, it is partitioned into blocks, each of which is stored in the shared memory of each thread block. They are processed by a CUDA kernel function corresponding to the Spark function called. Intermediate and/or final results are stored in the other array so that the original array can be reused for future computations. Finally, the result array is returned to the host and all the intermediate data are deleted from the device memory. The result array can be reused if needed. If the target Array Cache size is larger than the GPU device memory, it is divided into smaller sub-arrays, each of which is fit to the GPU device memory. Assuming, for example, the target Array Cache size is 10GB and GPU device memory size is 3GB, the Array Cache is divided into three 3GB sub-arrays and a single 1GB sub-array. These sub-arrays are processed by a single GPU device sequentially or, if multiple local and/or remote GPU devices are available, multiple sub-arrays are transferred to the available GPU devices and processed by them in parallel. Thus RDDs larger than GPUs device memory can be efficiently processed by utilizing multiple local and remote GPU devices. The host machine then gathers the partial results from these GPU devices and merges them to build the final result which will be stored as a new RDD. VI. EVALUATIONS A. Performance of Action Here, we compare Spark framework with CPU, local GPU, and remote GPU in terms of execution times of Action functions. Table I shows the evaluation environment. Java heap size was set to 200GB. In the remote GPU case, each GTX 980 Ti GPU is connected to the host via two SFP+ direct attached cables with NEC ExpEther10G.

6 (a) Sum (overhead: 15.6s) (b) Max (overhead: 15.4s) (c) Min (overhead: 15.6s) (d) LineCount (overhead: 28.3s) (e) SortByKey (overhead: 206.2s) (f) PatternMatch 10% (overhead: 28.4s) (g) PatternMatch 1% (overhead: 28.2s) (h) WordConversion (overhead: 28.6s) (i) Variance (overhead: 9.9s) Fig. 6. Execution times of programs with CPU, local GPU, and remote GPU TABLE I. EVALUATION ENVIRONMENT CPU Intel Xeon E v3 Number of CPU cores 4 Operating frequency 3.50GHz Memory capacity 256GB GPU GeForce GTX 980 Ti OS CentOS 6.7 CUDA version 7.5 jcuda version Spark version (Stand alone mode) Scala version 2.10 We implemented Sum, Max, Min, and LineCount programs for the performance evaluations of Action functions. Five continuous executions of a program (i.e., 1st, 2nd,..., 5th executions) are defined as a continuous execution set, and we performed 50 continuous execution sets for each program. Average execution times over the 50 sets are calculated. For the 1st execution, the execution time includes a Spark startup time and an RDD creation time. An Array Cache creation time is also needed for the GPU executions. Currently, Array Cache creation is not optimized and it takes a relatively long time (we will discuss this issue in Section VII). In Figure 6, a Spark start-up time, an RDD creation time, and Array Cache creation time are not included in the bars because they are required only when the target RDD is newly created. Among these three overheads, Array Cache creation time is only shown as overhead in a caption of each graph. Figure 6(a) shows the execution times of Sum program by CPU, local GPU, and remote GPU. A 1GB 32-bit signed integer array with random values is used as input data. As the start-up overheads, it takes 1.9 seconds, 8.4 seconds, and 15.6 seconds for Spark start-up, RDD creation, and Array Cache creation, respectively. Please note that a GPU execution time is a sum of a GPU processing time and a data transfer time between host and GPU device. It thus takes a longer time for the 1st execution since the Array Cache is transferred to the GPU device memory, while the execution times are much reduced for the subsequent executions since the transferred Array Cache can be reused. As shown in the graph, local GPU outperforms CPU by 11.4x for the 1st execution. It also outperforms CPU by x for the subsequent executions. Remote GPU is comparable to local GPU for the subsequent executions, while its performance gain is reduced for the 1st execution due to the data transfer overhead. Figures 6(b) and 6(c) show the execution times of Max

7 and Min programs by CPU, local GPU, and remote GPU, respectively. The 1GB integer array with random values is used as input. Figure 6(d) shows the execution times of LineCount program. A 1GB randomly-generated text file is used as input. Their results are similar to Sum program (Figure 6(a)). Local GPU outperforms CPU by up to 21.4x in LineCount program. B. Performance of Transformation We implemented SortByKey, PatternMatch, and WordConversion programs for the performance evaluations of Transformation functions. Figure 6(e) shows the execution times of SortByKey program by CPU, local GPU, and remote GPU. A 1GB key-value pairs array (512MB for keys and 512MB for values in total) is used as input. Please note that, for the GPU executions, only the key part (i.e., 512MB) is transferred to the GPU device memory for SortByKey program. An Array Cache creation time is quite long, because extraction of keys from a given RDD is performed in a naive manner with Scala language, which should be further optimized. For the GPU executions, as explained, the Array Cache is transferred to GPU device memory for the 1st execution, while it can be reused and the data transfer overheads can be eliminated for the subsequent executions. In Figure 6(e), however, there is almost no difference in the results of the 1st and subsequent executions in the local and remote GPU cases. This is because, since the GPU processing time of SortByKey program is large, the data transfer time becomes relatively small. As a result, local GPU outperforms CPU by 16.0x for the 1st execution (if we do not consider the Array Cache creation time) and by x for the subsequent executions. Also, remote GPU outperforms CPU by 14.2x for the 1st execution and by x for the subsequent executions. Figures 6(f) and 6(g) show the execution times of PatternMatch program with different hit rates (i.e., 10% and 1%). The 1GB text file in which 10% or 1% of words are matched to a given pattern is used as input. Figure 6(h) shows the execution times of WordConversion program. The 1GB randomly-generated text file is used as input. Due to the data transfer overhead is relatively large compared to computation time, performance gain of remote GPU over CPU is limited. C. Performance of Complex Processing We implemented Variance program for the performance evaluation of a compound processing. Figure 6(i) shows the execution times of Variance program. The 1GB integer array with random values is used as input. The Array Cache creation time is 9.9 seconds. Local GPU outperforms CPU by 11.5x for the 1st execution and by x for the subsequent executions. Remote GPU outperforms CPU by 5.8x for the 1st execution and by x for the subsequent executions. D. RDD Cache Policies In the previous subsections, we showed the measured execution times of various programs by CPU, local GPU, and remote GPU. Using these measured execution times as parameters, we perform simulations of RDD caching policies proposed in Section IV-B. Here we assume three applications that use the Action and Transformation programs we implemented. Since the first and second applications are simple, we assume that one local and one remote GPU devices are available. However, the third application is large and thus we assume that one local and seven remote GPU devices are available. We also assume that only a single Array Cache, which has been converted from a given RDD, can be cached at a time in each GPU device. If RDD j depends on RDD i, their two computations are performed sequentially, while if these two RDD do not have any dependencies, they are performed in parallel. In the first application, 1GB integer data, 1GB text data, and 1GB text data are loaded as RDD1, RDD2, and RDD3, respectively. Then the following steps are performed. 1) RDD1.Sum 2) RDD2.LineCount 3) RDD1.Variance 4) RDD4 = RDD3.WordConversion 5) RDD1.Max 6) RDD5 = RDD2.WordConversion 7) RDD1.Min In this application, the same RDD (i.e., RDD1) is used repeatedly. The most used one is RDD1. In the second application, 1GB key-value pair data, where key is a random integer value and value is a random string value, are loaded as RDD1. The following steps are performed. 1) RDD2 = RDD1.SortByKey 2) RDD3 = RDD2.values.WordConversion 3) RDD2.keys.Max 4) RDD2.keys.Min 5) RDD3.LineCount 6) RDD3.Min In this application, RDDs are created in a sequential manner along a linear RDD graph. The most used one is RDD2. To evaluate the RDD caching policies with a larger application, randomly-selected 50 functions are performed for randomly-selected RDDs, as the third application. The number of source RDDs is set to three. The randomly-generated RDDs have 1-13 edges as a result. Size of each RDD is set to 1GB. Tables II, III, and IV show the simulation results of the three applications. In the first application, we assume a situation that there is no dependency between RDDs. RDDs without dependency can be executed in parallel by using remote GPU devices, and the performance can be improved. When Policy-1 is adapted, all operations on RDD1 can be performed in local GPU device without swapping. However, since RDD2 and RDD3 are processed in remote GPU devices, the total data transfer amount between a host and remote GPU devices becomes large. As the execution time by remote GPU devices is 2x longer than that of local GPU device, the total performance gets worse. When Policy-2 is adapted, the number of data transfers between a host and remote GPU devices is only two. As a result, the data transfer overhead is reduced compared to Policy-1 and the performance gets better. When Policy-3 is

8 TABLE II. SIMULATION RESULTS OF THE FIRST APPLICATION RDD caching policy RDD1 RDD2 RDD3 Execution time Policy-1 Local Remote Remote (sec) Policy-2 Local Local Remote (sec) Policy-3 Remote Local Local (sec) Local GPU only Local Local Local (sec) TABLE III. SIMULATION RESULTS OF THE SECOND APPLICATION Fig. 7. RDD caching policy RDD1 RDD2 RDD3 Execution time Policy-1 Remote Local Remote (sec) Policy-2 Remote Local Local (sec) Policy-3 Local Local Remote (sec) Local GPU only Local Local Local (sec) TABLE IV. SIMULATION RESULTS OF THE THIRD APPLICATION RDD caching policy Policy-1 Policy-2 Policy-3 Local GPU only Execution time (sec) (sec) (sec) (sec) Execution times of Sum program on different RDD sizes adapted, RDD2 and RDD3 are processed in local GPU device and the total data transfer amount is minimized. As a result, Policy-3 achieves the best performance. In the second application, we assume a situation that there are dependencies between RDDs. The dependencies introduce waiting time during parallel processing and benefit of adding more remote GPU devices is reduced. As a result, Local GPU only that uses only local GPU device achieves a good performance. Nevertheless, Policy-3 achieves the best performance, because RDD2 and RDD3 can be processed in parallel by local and remote GPU after RDD3 has been created. In a large application such as the third application, Array Cache and parallel processing by using remote GPU devices lead to a performance improvement. In this situation, Policy- 3 achieves the best performance. Based on the above results, for parallel executable applications, adding more remote GPU devices can reduce the number of swap out of RDDs from GPU device memory and the data transfer overhead can be reduced. E. Performance of Large RDD As mentioned in Section V, an RDD larger than a GPU device memory is divided into sub-arrays and processed by one or more GPU devices in parallel. Assuming a GPU device memory size is 1GB for simplicity, here we evaluate the execution times of Sum program on an RDD (32-bit randomlygenerated integer array) with different sizes from 1GB to 8GB. We use three GPU devices (i.e., one local and two remote GPU devices, denoted as 3GPU ) for this evaluation; thus, up to three 1GB sub-arrays are processed by one local and two remote GPU devices in parallel. Such a parallel execution is repeated until all the sub-arrays have been completed. Figure 7 shows the 1st execution times of CPU, 3GPU+split, and split. Here split denotes the overhead time to divide a given RDD into 1GB sub-arrays. We performed 10 continuous execution sets, and an average of the 1st execution times is shown in the graph. As shown, the execution time of 3GPU+split increases as the data size increases. The split time is negligible compared to the computation time. Due to the performance asymmetricity of the three GPU devices, the execution times are slightly fluctuated especially when a remainder sub-array is processed by a remote GPU device. VII. SUMMARY AND FUTURE WORK To accelerate various computation intensive operations of Spark, we modified Spark framework to invoke CUDA kernels when computation intensive operations are called. We tried to cache RDDs in GPU device memory as much as possible. In addition, we proposed to use remote GPU devices which are connected to a host machine via a PCIe over 10GbE to expand the device memory capacity and use GPU devices in parallel. We proposed three RDD caching policies for local and remote GPU devices. We implemented Sum, Max, Min, LineCount, SortByKey, PatternMatch, WordConversion, and Variance programs on Spark framework with local and remote GPU devices. We evaluated the program execution times of local and remote GPU devices assuming the same operation is performed iteratively. The evaluation results showed that our modified Spark that utilizes local GPU device outperforms the original Spark by up to 21.4x. Performance degradation of remote GPU device compared to the local GPU device is not significant except for the first execution during the iteration or executions with large results. Due to the data transfer overhead for the remote GPU devices, evaluation results of the three RDD caching policies showed that the RDD caching policy that tries to minimize the data transfer amount to remote GPU devices outperforms the other caching policies. Although we used a single host machine with local and remote GPU devices, we can extend it to multiple host machines that maintain more local and remote GPU devices. Although the aim of this paper is explore how to utilize local and remote GPU devices from Spark framework, the Array Cache creation time should be reduced as a future work. We are currently using a built-in collect() function to build Array Cache from a given RDD for ease of implementation. To reduce the overhead, we are planning to implement our dedicated RDD-to-Array conversion function in our framework. Although we implemented the relatively simple applications for the demonstration purposes in this paper, we are currently planning to implement sophisticated applications and demonstrate the benefits of our approach as our future work. Acknowledgements This work was supported by SECOM Science and Technology Foundation and JST PRESTO.

9 REFERENCES [1] Apache Hadoop, [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster Computing with Working Sets, in Proceedings of the USENIX Conference on Hot Topics in Cloud Computing (HotCloud 10), Jun. 2010, pp [3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI 12), Apr. 2012, pp [4] P. Li, Y. Luo, N. Zhang, and Y. Cao, HeteroSpark: A Heterogeneous CPU/GPU Spark Platform for Machine Learning Algorithms, in Proceedings of the International Conference on Networking, Architecture and Storage (NAS 15), Aug. 2015, pp [5] cuspark - a Functional Data Processing Framework, yaomuyang.com/cuspark/. [6] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, Mars: A MapReduce Framework on Graphics Processors, in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT 08), Oct. 2008, pp [7] F. Ji and X. Ma, Using Shared Memory to Accelerate MapReduce on Graphics Processing Units, in Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS 11), May 2011, pp [8] M. Elteir, H. Lin, W. chun Feng, and T. Scogland, StreamMR: An Optimized MapReduce Framework for AMD GPUs, in Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS 11), Dec. 2011, pp [9] S. Morishima and H. Matsutani, Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure, in Proceedings of the International Symposium on Parallel and Distributed Processing with Applications (ISPA 15), Aug. 2015, pp [10], Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries, ACM SIGARCH Computer Architecture News (CAN), vol. 42, no. 4, pp , Sep [11] J. Suzuki, Y. Hidaka, J. Higuchi, T. Yoshikawa, and A. Iwata, ExpressEther - Ethernet-Based Virtualization Technology for Reconfigurable Hardware Platform, in Proceedings of the IEEE Annual Symposium on High-Performance Interconnects (HOTI 06), Aug. 2006, pp [12] J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa, and T. Yoshikawa, End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet, in Proceedings of the IEEE Annual Symposium on High-Performance Interconnects (HOTI 14), Aug. 2014, pp [13] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, rcuda: Reducing the Number of GPU-based Accelerators in High Performance Clusters, in Proceedings of the International Conference on High Performance Computing and Simulation (HPCS 10), Jun. 2010, pp [14] S. Morishima and H. Matsutani, Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet, in Proceedings of the International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 16), Aug [15] jcuda.org, [16] Optimizing Parallel Reduction in CUDA, cuda/samples/6 Advanced/reduction/doc/reduction.pdf.

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522

More information

An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC

An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC Korechika Tamura Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: tamura@arc.ics.keio.ac.jp Hiroki Matsutani Dept.

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Accelerator Design for Big Data Processing Frameworks

Accelerator Design for Big Data Processing Frameworks Accelerator Design for Big Data Processing Frameworks Hiroki Matsutani Dept. of ICS, Keio University http://www.arc.ics.keio.ac.jp/~matutani July 5th, 2017 International Forum on MPSoC for Software-Defined

More information

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators Gemini: Boosting Spark Performance with GPU Accelerators Guanhua Wang Zhiyuan Lin Ion Stoica AMPLab EECS AMPLab UC Berkeley UC Berkeley UC Berkeley Abstract Compared with MapReduce, Apache Spark is more

More information

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect Alexander Agarkov and Alexander Semenov JSC NICEVT, Moscow, Russia {a.agarkov,semenov}@nicevt.ru

More information

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa,

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Accelerating Blockchain Search of Full Nodes Using GPUs

Accelerating Blockchain Search of Full Nodes Using GPUs Accelerating Blockchain Search of Full Nodes Using GPUs Shin Morishima Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan Email: morisima@arc.ics.keio.ac.jp Abstract Blockchain is a

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

a Spark in the cloud iterative and interactive cluster computing

a Spark in the cloud iterative and interactive cluster computing a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

G-Storm: GPU-enabled High-throughput Online Data Processing in Storm

G-Storm: GPU-enabled High-throughput Online Data Processing in Storm 215 IEEE International Conference on Big Data (Big Data) G-: GPU-enabled High-throughput Online Data Processing in Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat and Charles Kamhoua Abstract The Single

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Docker @ CDS André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 1Centre de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Paul Trehiou Université de technologie de Belfort-Montbéliard

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Framework of rcuda: An Overview

Framework of rcuda: An Overview Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Parallelization of K-Means Clustering Algorithm for Data Mining

Parallelization of K-Means Clustering Algorithm for Data Mining Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com

More information

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters 2nd IEEE International Conference on Cloud Computing Technology and Science Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters Koichi Shirahata,HitoshiSato and Satoshi Matsuoka Tokyo Institute

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction.

Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction. Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction Sanzhar Aubakirov 1, Paulo Trigo 2 and Darhan Ahmed-Zaki 1 1 Department of Computer Science, al-farabi Kazakh National

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

GeoImaging Accelerator Pansharpen Test Results. Executive Summary Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Virtual Security Server

Virtual Security Server Data Sheet VSS Virtual Security Server Security clients anytime, anywhere, any device CENTRALIZED CLIENT MANAGEMENT UP TO 50% LESS BANDWIDTH UP TO 80 VIDEO STREAMS MOBILE ACCESS INTEGRATED SECURITY SYSTEMS

More information

Distributed Implementation of BG Benchmark Validation Phase Dimitrios Stripelis, Sachin Raja

Distributed Implementation of BG Benchmark Validation Phase Dimitrios Stripelis, Sachin Raja Distributed Implementation of BG Benchmark Validation Phase Dimitrios Stripelis, Sachin Raja {stripeli,raja}@usc.edu 1. BG BENCHMARK OVERVIEW BG is a state full benchmark used to evaluate the performance

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

Accelerating Data Warehousing Applications Using General Purpose GPUs

Accelerating Data Warehousing Applications Using General Purpose GPUs Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

Adaptive Control of Apache Spark s Data Caching Mechanism Based on Workload Characteristics

Adaptive Control of Apache Spark s Data Caching Mechanism Based on Workload Characteristics Adaptive Control of Apache Spark s Data Caching Mechanism Based on Workload Characteristics Hideo Inagaki, Tomoyuki Fujii, Ryota Kawashima and Hiroshi Matsuo Nagoya Institute of Technology,in Nagoya, Aichi,

More information

RIDE: real-time massive image processing platform on distributed environment

RIDE: real-time massive image processing platform on distributed environment Kim et al. EURASIP Journal on Image and Video Processing (2018) 2018:39 https://doi.org/10.1186/s13640-018-0279-5 EURASIP Journal on Image and Video Processing RESEARCH RIDE: real-time massive image processing

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment International Journal of Future Computer and Communication, Vol. 6, No. 1, March 2017 On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment Kunaphas Kongkitimanon

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

GPU-Accelerated Apriori Algorithm

GPU-Accelerated Apriori Algorithm GPU-Accelerated Apriori Algorithm Hao JIANG a, Chen-Wei XU b, Zhi-Yong LIU c, and Li-Yan YU d School of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b wei1517@126.com,

More information

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters 1 Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters Yuan Yuan, Meisam Fathi Salmi, Yin Huai, Kaibo Wang, Rubao Lee and Xiaodong Zhang The Ohio State University Paypal Inc. Databricks

More information

An FPGA NIC Based Hardware Caching for Blockchain

An FPGA NIC Based Hardware Caching for Blockchain An FPGA NIC Based Hardware Caching for Blockchain Yuma Sakakibara, Kohei Nakamura, and Hiroki Matsutani Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan {yuma,nakamura,matutani}@arc.ics.keio.ac.jp

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics

Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics Jorge Veiga, Roberto R. Expósito, Xoán C. Pardo, Guillermo L. Taboada, Juan Touriño Computer Architecture Group, Universidade

More information