WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

Size: px
Start display at page:

Download "WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3"

Transcription

1 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and Analytics Consultant Date: May 31, 2017

2 Table of Contents 1. Apache Spark RDD, DataFrame and Dataset Executive Summary RDD, DataFrame and Dataset API Comparison RDD, DataFrame and Dataset Benchmark Plan Benchmark Goals Benchmark Dataset Benchmark Data Load and Caching Benchmark Clusters Specifications Benchmark Results Basic Data Manipulation API Comparison in Spark Basic Data Manipulation API Spark 1.6 vs. Spark Join Comparison RDD GroupBy vs. ReduceByKey Increasing Cluster Size About SWI Appendix A - Benchmark Detailed Results A.1. Result for Filter benchmark A.2. Result for Aggregate By Key benchmark A.3. Result for Sort benchmark A.4. Result for Join benchmark Appendix B - Benchmark Code B.1. RDD / Dataset object B.2. Filter Code B.3. Aggregate By Key Code B.4. Sort Code B.5. Join Code B.6. Broadcast Code Apache Spark: RDD, DataFrame and Dataset Page 2 of 24

3 1. Apache Spark RDD, DataFrame and Dataset Apache Spark is one of the most popular computation engines in the Big Data Space. Its distributed in-memory computation is renowned for its speed, which the official Apache Spark page claimed to be 100 times faster than Map-Reduce. In addition to its speed, Apache Spark also includes SQL support and a rich machine learning library, which makes it a favourite choice for Data Scientists for their analytical processing. Apache Spark computations are performed on distributed object collections, which are the foundation of the Apache Spark distributed computation engine. The Resilient Distributed Dataset (RDD) was the first type of distributed object collection offered by Apache Spark from its very first version. Since then, Apache Spark has been extended to also include DataFrames (Since version 1.3) and Datasets (Since version 1.6). This paper compares the API and performance of these three distributed collections (RDD, DataFrame and Datasets). The benchmarks were run on the latest available version of Apache Spark (version 2.1), as well as on the latest available version of the Spark 1.x (Version 1.6.3) for those who have not yet upgraded to Spark 2.x. Apache Spark: RDD, DataFrame and Dataset Page 3 of 24

4 2. Executive Summary DataFrame has shown significant performance improvements over RDD in both Spark version 1.6 and Spark version 2.1. In Spark 2.1, when comparing the Dataset Untyped API (i.e. DataFrame v2) and Typed API, a significant performance difference was identified for the seemingly equivalent APIs. Untyped aggregate and sort transformation were 6 times faster than their Typed equivalents, while Typed join was 4.5 times faster than the Untyped equivalent. Also, join in Spark 2.1 DataFrame v2 (Untyped Dataset) executed twice as long as in Spark 1.6 DataFrame v1. When implementing a Spark 2.1 application, it is recommended to try implementing both Typed and Untyped APIs and select the one with the best performance for your operation. This should not be hard to implement as both are supported on the same Dataset collection. When migrating DataFrame from Spark 1.6 to Spark 2.1 there may be performance decline and some code adjustment may be required. It is a common practice to add more worker nodes and resources to your cluster in order to boost performance. It is important to note that there is exponentially reducing ROI to throwing more hardware at the problem. Beyond a certain point, the cost greatly exceeds the benefits. The flexibility of Spark in the cloud is a cost-effective way to try several scenarios and find the optimal cluster size for your needs. Apache Spark: RDD, DataFrame and Dataset Page 4 of 24

5 3. RDD, DataFrame and Dataset API Comparison Resilient Distributed Datasets (RDDs) has been available since Spark version 0.5 (Jun 2012), and are still the underlying basic API for Spark. Each element in the RDD is a strongly typed object, which is compiled to a Java object and resides in the JVM Heap. Certain operations, such as joining two RDDs, require a specific type of RDD called PairRDD. The PairRDD elements are a tuple of Key and Value (both Key and Value can be any serializable object). DataFrames were introduced in Spark version 1.3. The DataFrame is a Spark abstraction for a table, which allows executing analytic SQL queries (SQL 2003) on any data source (Text File, XML, Json etc.). DataFrame elements are untyped. Each element is stored in the generic object Row and has a Multi-level Schema (similar to Parquet files). In order to optimize memory utilization and speed operation, DataFrames are stored in a privately managed memory space outside the JVM Memory heap. Dataset is an attempt to combine the benefits of both RDD and DataFrame. It is strongly typed like RDD, but also supports SQL and stored off-heap like DataFrame. Dataset supports two sets of transformations: Typed (RDD style) an Untyped (DataFrame style). Datasets were first introduced in Spark version 1.6 but were considered too experimental and not recommended for use prior to Spark version 2.x. Since Datasets are stored off-heap, storing an object in a Dataset requires an encoder. Spark provides implicit methods to generate encoders for all basic types and all Scala case classes. In Spark version 2.x DataFrames and Datasets are combined to a single entity, where DataFrame is actually only a defined type Dataset[Row]. Due to that, there is a major API change for DataFrames between Spark 1.x and Spark 2.x. In this document we will refer to Spark 1.x DataFrames as DataFrame v1 and Spark 2.x DataFrames as DataFrame v2. Apache Spark: RDD, DataFrame and Dataset Page 5 of 24

6 4. RDD, DataFrame and Dataset Benchmark Plan 4.1. Benchmark Goals The benchmark goals were to test the following scenarios: 1. Compare performance of basic data manipulation transformations in all distributed collections in Spark version 2.1. We selected to compare filter, sort, and aggregate by key operations, as they represent the most common operations used to manipulate data that involve non-trivial work load (e.g. they usually require shuffling of data between worker nodes). 2. In order to provide insights for those who are considering an upgrade, or are locked into older Spark versions, we compared the performance of the basic transformation in RDD and DataFrame v1 in Spark version 1.6 versus the performance of RDD and DataFrame v2 in Spark version Data manipulation sometimes require combining data from multiple sources. To test this, we compared the performance of joining our large data set to a small and medium data set. Spark documentation recommends using Broadcast to optimize small joins, so we also compared Broadcast vs. small join. 4. Spark documentation also recommends avoiding RDD GroupBy and use ReduceByKey instead, so we compared the performance of both operations. 5. We know that cluster size has an impact on performance, so we compared the same operation on multiple cluster sizes (see Table 2 below). Apache Spark: RDD, DataFrame and Dataset Page 6 of 24

7 4.2. Benchmark Dataset In order to provide a meaningful benchmark, we established the following guidelines for our intended dataset: Data volume should be representative of large data processing. We decided on a minimal threshold of at least a billion rows. Data should be structured, so it could be easily imported into RDD, DataFrame and Dataset. Data should contain numeric fields for GroupBy calulations Data should contain many unique keys so Join and GroupBy operation would be meaningful. Data should be open and available for others who wish to run our benchmarks themselves. We selected to use the AWS hosted Google Books N-Grams dataset ( ). We used the British English, 4 gram set which has 5,325,077,699 rows and about 200GB in uncompressed file size (46.6 GB compressed sequence files). An N-gram is a set of N tokens which appeared consecutively in some written corpus. In the case of the Google Books dataset, that written corpus consisted of many thousands of books which were digitized by Google. For example, the 3-grams for the sentence The yellow dog played fetch. are: [The, yellow, dog]; [yellow, dog, played]; [dog, played, fetch]; and [played, fetch,.] (punctuation also counts as a token and words are case-sensitive). The benchmark s data schema is detailed in Table 1. Apache Spark: RDD, DataFrame and Dataset Page 7 of 24

8 Table 1: Benchmark Schema Field Type Description N-Gram Array[String] array of 4 words Year Int year of the document Occurrences Int Number of times N-Gram appeared in Year Pages Int Number of unique pages containing the N-Gram in Year Books Int Number of unique books containing the N-Gram in Year 4.3. Benchmark Data Load and Caching The AWS N-Gram data set is stored as a sequence file. The AWS sequence file was preprocessed into a parquet file and read into collection in the following order: Read Parquet file into DataFrame From the DataFrame above, a Dataset[N-Gram] was generated. The Dataset[N-Gram] above was used to generate RDD[N-Gram] We initially read the sequence file directly to RDD and use it to generate the other elements. We found its results to be comparable to the method above, so we decided to use the Parquet method instead, as it offered simpler collections creation. In order to prevent misleading results, every transformation benchmark executed on its own, running one transformation at a time, with no chaining of multiple transformations. As only one transformation was run in every run, no caching of collections where performed (i.e. no cache() or persist() performed on any collection). Apache Spark: RDD, DataFrame and Dataset Page 8 of 24

9 4.4. Benchmark Clusters Specifications The benchmark was executed on both AWS and Azure HDInsight clusters of varied sizes, in order to provide more comprehensive benchmark results. The benchmark clusters configuration is detailed in Table 2. Table 2: Benchmark Clusters Configuration Name Environment Master Nodes Worker Nodes Storage Used AWS4 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS8 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS16 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS32 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) HDI4 Azure HDInsight 2 X (4 Cores, 7GB RAM each) HDI13 Azure HDInsight 2 X (4 Cores, 7GB RAM each) 4 X r4.xlarge (4 Cores, 30GB RAM each) 8 X r4.xlarge (4 Cores, 30GB RAM each) 16 X r4.xlarge (4 Cores, 30GB RAM each) 32 X r4.xlarge (4 Cores, 30GB RAM each) 4 X (4 Cores, 28GB RAM each) 13 X (4 Cores, 28GB RAM each) S3 S3 S3 S3 Local Cluster HDFS Local Cluster HDFS Apache Spark: RDD, DataFrame and Dataset Page 9 of 24

10 5. Benchmark Results Following is an analysis of the benchmark results. The full results as well as the benchmark code are detailed in the appendixes Basic Data Manipulation API Comparison in Spark 2.1 We compared the performance of RDD, DataFrame and Dataset in Spark version 2.1. We checked the individual performance of Filter, Aggregate by Key and Sort transformations, as they are the staple Data Manipulation task. They are also no trivial tasks, in that they require major operations on the collections and moving data between different data nodes. We used total CPU time as the benchmark number, as it is a better representation of the actual cluster activity. We normalized all numbers by RDD total CPU time for an ease of comparison between the collections. Chart 1 below shows the benchmark results based on the HDI13 run. Chart 1 Basic API Comparison Spark version % 250% 200% 150% 100% 50% Basic API Compare v2.1 - CPU as % of RDD 0% Filter Aggregate by Key Sort RDD DataFrame v2 Dataset v2 Apache Spark: RDD, DataFrame and Dataset Page 10 of 24

11 As expected, DataFrame v2 has outperformed the older RDD, taking less than half the time to perform sort or aggregate transformation and less than three quarters of the time to filter. Dataset performance provided a very surprising result. Considering that in Spark 2.1 DataFrame is Dataset[Row], the benchmark is actually between the Dataset Typed and Untyped transformations, which on the surface seem equivalent. However, the actual results show that the Dataset Typed Transformation performance has been significantly slower (about six times slower than Untyped transformation used in DataFrame and 2.5 times slower than RDD in performing aggregate or sort transformation). Apache Spark: RDD, DataFrame and Dataset Page 11 of 24

12 5.2. Basic Data Manipulation API Spark 1.6 vs. Spark 2.1 We compared the performance of the same basic API in Spark 1.6. We only tested RDD and DataFrames as Datasets are not recommended for use prior to Spark version 2.x. Due to data capture limitation in Spark 1.6 we could only collect elapsed time for our operations, so it was used for the benchmark. We compared the operations of RDD and DataFrame in Spark version 1.6. We normalized all numbers by RDD elapsed time for ease of comparison. Chart 2 below shows the benchmark results based on the HDI13 run. Chart 2 Basic API Comparison Spark version % 100% 80% 60% 40% 20% Basic API Compare v1.6 - Elapsed % of RDD 0% Filter Aggregate by Key Sort RDD DataFrame v1 We also compared the operations of RDD and DataFrame in Spark version 1.6 vs. the same API in Spark version 2.1. We normalized all numbers showing Spark 2.1 elapsed time as percent of Spark 1.6 time for ease of comparison. Chart 3 below shows the benchmark results based on the HDI13 and HDI4 runs. Apache Spark: RDD, DataFrame and Dataset Page 12 of 24

13 Chart 3 Basic API Comparison Spark version 1.6 vs. Spark version % 100% 80% 60% 40% 20% Spark v2.1 as % of v1.6 - Elapsed 0% Filter Aggregate by Key Sort Sort (HDI4) RDD DataFrame v2 vs. v1 We see a significant improvement in filter time between Spark 2.1 and Spark 1.6. DataFrame and RDD show only slight improvements for the aggregate between the Spark versions. RDD sort is slightly better, while the result for DataFrame sort is mixed. DataFrame sort on HDI13 was 15% slower in Spark version 2.1, while on HDI4 it was slightly faster than the Spark version 1.6 on HDI4. This may be attributed to the fact that we had to use elapsed time for our comparison and the difference in sort elapsed times for HDI13 is less than 22 seconds. Apache Spark: RDD, DataFrame and Dataset Page 13 of 24

14 Total CPU (minutes) Join Comparison When we tried to perform any join operation on our large RDD the Spark operation terminated with an error. Even joins to a ~300 records table could not complete. This is a clear indication that RDDs are not suitable for joining operations. We proceeded in testing join to a small table (~300 rows) in DataFrame and Dataset. We also simulated a join using Broadcast variable and compared it to the small join time. We used total CPU time as the benchmark number. Chart 4 below shows the benchmark results based on the HDI13 run. Chart 4 Join Comparison Spark version 2.1 1,400 1,200 1, Small Join vs. Broadcast 0 Small Join Broadcast DataFrame v2 Dataset v2 For joins it was DataFrame performance provided a very surprising result. As described above, the benchmark is actually between the Dataset Typed and Untyped transformations, which on the surface seem equivalent. However, the actual results show that the Dataset Typed Transformation performance has been significantly faster (about 4.5 times faster than Untyped transformation used in DataFrame). Apache Spark: RDD, DataFrame and Dataset Page 14 of 24

15 Although DataFrame seems to benefit from replacing small join with broadcast, Dataset small join operation in Spark version 2.1 appear to be optimized and equivalent in performance to a Broadcast operation. DataFrame small join performance in Spark version 2.1 seemed oddly slow, so we compared it with DataFrame small join in Spark version 1.6. To our surprise, DataFrame in Spark 1.6 completed the small join operation in half the time than DataFrame in Spark version 2.1, but was almost three times slower than the equivalent typed join transformation of Dataset. DataFrame v1 (Sparx 1.x) and DataFrame v2 (Spark 2.x) are very different under the hood. This is an important point to consider when upgrading your application. You may need to rewrite your code to regain performance that was lost through the upgrade. Apache Spark: RDD, DataFrame and Dataset Page 15 of 24

16 5.4. RDD GroupBy vs. ReduceByKey We tested the Spark documented recommendation of using RDD ReduceByKey instead of GroupBy followed by Reduce. The ReduceByKey completed in less than 15 minutes elapsed time. The GroupBy did not complete after 7.5 hours with executors restarting multiple times due to internal errors. We confirmed that RDD GroupBy should be avoided at all cost. We recognize that there are situations, especially in Data Science calculations, in which GroupBy operation cannot be replaced by ReduceByKey. For example, doing a window operation such as calculating the average annual growth rate of N-Gram s usage, which is calculated for each N-Gram by the following formula based on its years in ascending order: 1 occurrences i+1 occurrences i n i. year i+1 year i We where able to calculate the average annual growth on our large dataset on HDI13 cluster using Dataset typed GroupBy (with iterators over the groups) and DataFrame SQL window functions. Dataset GroupBy completed in 48 minutes and DataFrame SQL window functions (lag) completed in 26 minutes. For cases in which ReduceByKey cannot replace GroupBy your best alternative is to use DataFrame SQL window functions, and if not possible use Dataset typed GroupBy. Apache Spark: RDD, DataFrame and Dataset Page 16 of 24

17 5.5. Increasing Cluster Size Increasing the cluster resources by adding more worker nodes should result in improved performance. The additional nodes add to your total cost of operations, especially in pay-per-use environment such as AWS or Azure HDInsight. We ran the Filter API on a set of AWS clusters doubling the worker nodes after each run. This allowed us to asses the cost-benefit ratio of the increased cluster size. The relevant measure for the cluster effectiveness is the elapsed time, as it represents the total time the cluster spent to produce the result. In pay-per-use environment your cost is also directly related to the total elapsed time the cluster was used. First we looked at the percent marginal improvement in elapsed time when we doubled the worker nodes from 4 to 32 in 4 steps. Chart 5 below shows the result. Chart 5 Percent Elapsed improvement when increasing worker nodes 50% 40% 30% 20% 10% % Improvement - Double the worker nodes 0% RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 We could clearly see that the rate of improvement was declining. We then compared the actual elapsed time improvements in minutes. To assess the cost-effectiveness of the improvement we divided the extra cluster cost of the total marginal improved minutes to get the cost per one minute of marginal improvement. Charts 6 and 7 below shows the result. Apache Spark: RDD, DataFrame and Dataset Page 17 of 24

18 Chart 6 Total Elapsed improvement when increasing worker nodes Minutes Improvement - Double the worker nodes RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 Chart 7 Cost per Elapsed minute improvement when increasing nodes $2.50 Cost per Min Improved $2.00 $1.50 $1.00 $0.50 $- RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 Apache Spark: RDD, DataFrame and Dataset Page 18 of 24

19 The result of the cost-benefit comparison of cost per marginal minute improvement shows an exponentially declining benefit of increasing the cluster size and resources. The optimal cost-benefit in our situation seems to be between the 8 and 16 worker nodes. There seems to be very little benefit in increasing the working nodes in the cluster from 16 worker nodes to the 32 worker nodes. Apache Spark: RDD, DataFrame and Dataset Page 19 of 24

20 6. About SWI SWI offers deep business expertise and a flawless track record of delivery for high impact technology projects in the Energy and Financial sectors. SWI has over 35 years of experience offering dependable technology consulting services. Since SWI was founded in 1978, we have provided solutions for nuclear, financial, energy and transportation organizations industries where dependability and business continuity is critical. Over the last three decades, SWI has developed a distinguished reputation based on a flawless track record of solutions, long-term partnerships with our clients and deep domain expertise in the energy and financial sectors. SWI s history provides a solid foundation for growth, building on the company s notable achievements, including: Long-term relationships with clients that have grown through trust and a shared commitment to excellence History of delivering high-quality solutions on schedule and on budget Participation in professional associations and standards organizations as part of our commitment to continuous improvement Reputation for best-in-class solutions We feel strongly that by building long-term partnerships with our clients, based on trust and an ongoing commitment to excellence, we can accomplish great things. WHAT WE DO SWI provides a full range of consulting services, including custom software development, systems integration, project management, strategic planning, business analysis, and quality assurance solutions. We strive to provide our clients with the highest quality solutions to gain the best results for their business. WHO WE ARE Located in midtown Toronto, SWI is an IT professional services firm with a 35-year track record of delivering high-value solutions to enterprise customers in the Energy and Financial sectors. SWI has a reputation for maintaining deep industry knowledge and extensive experience in architecting and implementing complex IT systems. Apache Spark: RDD, DataFrame and Dataset Page 20 of 24

21 Appendix A - Benchmark Detailed Results General notes on all results: All times are in minutes Spark version 1.6 does not provide total CPU time or Garbage Collection (GC) time. Dataset was only tested in Spark version 2.1 A.1. Result for Filter benchmark Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS4 v2.1 1, , , AWS8 v2.1 1, , , AWS16 v2.1 1, , , AWS32 v2.1 1, , , HDI4 v HDI4 v1.6 N/A N/A N/A N/A N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 9.64 N/A N/A 7.39 N/A N/A N/A A.2. Result for Aggregate By Key benchmark Note: AWS4 had execution errors during the calculation and is omitted from result list. Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS8 v2.1 1, , AWS16 v2.1 1, , AWS32 v2.1 1, , HDI4 v HDI4 v1.6 N/A N/A N/A N/A 5.99 N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 6.72 N/A N/A N/A N/A N/A Apache Spark: RDD, DataFrame and Dataset Page 21 of 24

22 A.3. Result for Sort benchmark Note: AWS4 had execution errors during the calculation and is omitted from result list. Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS8 v2.1 1, , AWS16 v2.1 1, , AWS32 v2.1 1, , HDI4 v HDI4 v1.6 N/A N/A N/A N/A 5.97 N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 6.86 N/A N/A 2.60 N/A N/A N/A A.4. Result for Join benchmark All Join tests where run on the HDI13 cluster. The Join benchmark tested three types of joins: Small: Join of our big N-Gram collection with a 300 rows reference table. Broadcast: using Broadcast of the 300 rows instead of Join Medium: Join of our big N-Gram collection with a medium (1GB) table. Notes: Error indicated the Join operation failed to complete and exited with an error. Since small join failed for RDD, we did not run broadcast on it. Dataset was not run for Spark version 1.6 Join Type Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed Small v2.1 Error Error Error 1, Broadcast v2.1 N/A N/A N/A Medium v2.1 Error Error Error Error Error Error Small v1.6 Error Error Error N/A N/A N/A N/A N/A Apache Spark: RDD, DataFrame and Dataset Page 22 of 24

23 Appendix B - Benchmark Code B.1. RDD / Dataset object case class NGram(row: Long, gram: Seq[String], year: Int, occurances: Long, pages: Long, books: Long) { } def +(other: NGram): NGram = { NGram(max(row, other.row), gram, max(year, other.year), occurances + other.occurances, pages + other.pages, books + other.books) } B.2. Filter Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code rdd.filter(_.gram.forall(containslowercaselettersonly)) df.filter(concat_ws("", data.col("gram")).rlike("^[a-z]++$")) ds.filter(_.gram.forall(containslowercaselettersonly)) B.3. Aggregate By Key Code Collection Type RDD[N-Gram] DataFrame Code rdd.map(row => row.gram -> row).reducebykey(_ + _) df.groupby("gram").agg(max("year").as("year"), sum("occurances").as("occurances")) Dataset[N-Gram] ds.groupbykey(_.gram).mapgroups { case (_, rows) => rows.reduce(_ + _) } B.4. Sort Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code rdd.sortby(_.occurances, ascending = false) df.sort(data.col("occurances").desc) ds.sort(ds.col("occurances").desc) Apache Spark: RDD, DataFrame and Dataset Page 23 of 24

24 B.5. Join Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code val keyeddata = rdd.map(row => row.year -> row) val keyedreference = refrdd.map(row => row.year -> row) keyeddata.join(keyedreference).map{ case (_, (v1, v2)) => (Some(v1), Some(v2)) } bigdf.join(smalldf, "year") bigds.joinwith(smallds, bigds.col("year") === smallds.col("year"), "inner") B.6. Broadcast Code Collection Type Code RDD[N-Gram] DataFrame rdd.map(ngram => (ngram, reference.value.get(ngram.year))) df.withcolumn("yearinwords",bc.value.get(df.col("year"))) Dataset[N-Gram] ds.map(ngram => (ngram, bc.value.get(ngram.year)) ) Apache Spark: RDD, DataFrame and Dataset Page 24 of 24

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Benchmarks Prove the Value of an Analytical Database for Big Data

Benchmarks Prove the Value of an Analytical Database for Big Data White Paper Vertica Benchmarks Prove the Value of an Analytical Database for Big Data Table of Contents page The Test... 1 Stage One: Performing Complex Analytics... 3 Stage Two: Achieving Top Speed...

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Bringing Data to Life

Bringing Data to Life Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Dr. Roland Michaely 2015 The MathWorks, Inc. 1 Data Analytics Workflow Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( ) Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Evolution From Shark To Spark SQL:

Evolution From Shark To Spark SQL: Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

1

1 1 2 3 6 7 8 9 10 Storage & IO Benchmarking Primer Running sysbench and preparing data Use the prepare option to generate the data. Experiments Run sysbench with different storage systems and instance

More information

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA Python, PySpark and Riak TS Stephen Etheridge Lead Solution Architect, EMEA Agenda Introduction to Riak TS The Riak Python client The Riak Spark connector and PySpark CONFIDENTIAL Basho Technologies 3

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

MAPR DATA GOVERNANCE WITHOUT COMPROMISE MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance

More information

RightScale 2018 State of the Cloud Report DATA TO NAVIGATE YOUR MULTI-CLOUD STRATEGY

RightScale 2018 State of the Cloud Report DATA TO NAVIGATE YOUR MULTI-CLOUD STRATEGY RightScale 2018 State of the Cloud Report DATA TO NAVIGATE YOUR MULTI-CLOUD STRATEGY RightScale 2018 State of the Cloud Report As Public and Private Cloud Grow, Organizations Focus on Governing Costs Executive

More information

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71 Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

SoftNAS Cloud Performance Evaluation on Microsoft Azure

SoftNAS Cloud Performance Evaluation on Microsoft Azure SoftNAS Cloud Performance Evaluation on Microsoft Azure November 30, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for Azure:... 5 Test Methodology...

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Using Apache Spark for generating ElasticSearch indices offline

Using Apache Spark for generating ElasticSearch indices offline Using Apache Spark for generating ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe 2016 Who am I Software engineer in database systems team Responsible

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

ELTMaestro for Spark: Data integration on clusters

ELTMaestro for Spark: Data integration on clusters Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be

More information

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Spark and distributed data processing

Spark and distributed data processing Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

CarbonData: Spark Integration And Carbon Query Flow

CarbonData: Spark Integration And Carbon Query Flow CarbonData: Spark Integration And Carbon Query Flow SparkSQL + CarbonData: 2 Carbon-Spark Integration Built-in Spark integration Spark 1.5, 1.6, 2.1 Interface SQL DataFrame API Integration: Format Query

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

SparkSQL 11/14/2018 1

SparkSQL 11/14/2018 1 SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3

More information

Using ElasticSearch to Enable Stronger Query Support in Cassandra

Using ElasticSearch to Enable Stronger Query Support in Cassandra Using ElasticSearch to Enable Stronger Query Support in Cassandra www.impetus.com Introduction Relational Databases have been in use for decades, but with the advent of big data, there is a need to use

More information

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would

More information