WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

Size: px

Start display at page:

Download "WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3"

Kory Nash
5 years ago
Views:

1 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and Analytics Consultant Date: May 31, 2017

Table of Contents 1. Apache Spark RDD, DataFrame and Dataset... 3 2. Executive Summary... 4 3. RDD, DataFrame and Dataset API Comparison... 5 4. RDD, DataFrame and Dataset Benchmark Plan... 6 4.1. Benchmark Goals.

1... 10 5.2. Basic Data Manipulation API Spark 1.6 vs. Spark 2.1... 12 5.3. Join Comparison... 14 5.4. RDD GroupBy vs. ReduceByKey... 16 5.5. Increasing Cluster Size... 17 6. About SWI.

2 Table of Contents 1. Apache Spark RDD, DataFrame and Dataset Executive Summary RDD, DataFrame and Dataset API Comparison RDD, DataFrame and Dataset Benchmark Plan Benchmark Goals Benchmark Dataset Benchmark Data Load and Caching Benchmark Clusters Specifications Benchmark Results Basic Data Manipulation API Comparison in Spark Basic Data Manipulation API Spark 1.6 vs. Spark Join Comparison RDD GroupBy vs. ReduceByKey Increasing Cluster Size About SWI Appendix A - Benchmark Detailed Results A.1. Result for Filter benchmark A.2. Result for Aggregate By Key benchmark A.3. Result for Sort benchmark A.4. Result for Join benchmark Appendix B - Benchmark Code B.1. RDD / Dataset object B.2. Filter Code B.3. Aggregate By Key Code B.4. Sort Code B.5. Join Code B.6. Broadcast Code Apache Spark: RDD, DataFrame and Dataset Page 2 of 24

3 1. Apache Spark RDD, DataFrame and Dataset Apache Spark is one of the most popular computation engines in the Big Data Space. Its distributed in-memory computation is renowned for its speed, which the official Apache Spark page claimed to be 100 times faster than Map-Reduce. In addition to its speed, Apache Spark also includes SQL support and a rich machine learning library, which makes it a favourite choice for Data Scientists for their analytical processing. Apache Spark computations are performed on distributed object collections, which are the foundation of the Apache Spark distributed computation engine. The Resilient Distributed Dataset (RDD) was the first type of distributed object collection offered by Apache Spark from its very first version. Since then, Apache Spark has been extended to also include DataFrames (Since version 1.3) and Datasets (Since version 1.6). This paper compares the API and performance of these three distributed collections (RDD, DataFrame and Datasets). The benchmarks were run on the latest available version of Apache Spark (version 2.1), as well as on the latest available version of the Spark 1.x (Version 1.6.3) for those who have not yet upgraded to Spark 2.x. Apache Spark: RDD, DataFrame and Dataset Page 3 of 24

4 2. Executive Summary DataFrame has shown significant performance improvements over RDD in both Spark version 1.6 and Spark version 2.1. In Spark 2.1, when comparing the Dataset Untyped API (i.e. DataFrame v2) and Typed API, a significant performance difference was identified for the seemingly equivalent APIs. Untyped aggregate and sort transformation were 6 times faster than their Typed equivalents, while Typed join was 4.5 times faster than the Untyped equivalent. Also, join in Spark 2.1 DataFrame v2 (Untyped Dataset) executed twice as long as in Spark 1.6 DataFrame v1. When implementing a Spark 2.1 application, it is recommended to try implementing both Typed and Untyped APIs and select the one with the best performance for your operation. This should not be hard to implement as both are supported on the same Dataset collection. When migrating DataFrame from Spark 1.6 to Spark 2.1 there may be performance decline and some code adjustment may be required. It is a common practice to add more worker nodes and resources to your cluster in order to boost performance. It is important to note that there is exponentially reducing ROI to throwing more hardware at the problem. Beyond a certain point, the cost greatly exceeds the benefits. The flexibility of Spark in the cloud is a cost-effective way to try several scenarios and find the optimal cluster size for your needs. Apache Spark: RDD, DataFrame and Dataset Page 4 of 24

5 3. RDD, DataFrame and Dataset API Comparison Resilient Distributed Datasets (RDDs) has been available since Spark version 0.5 (Jun 2012), and are still the underlying basic API for Spark. Each element in the RDD is a strongly typed object, which is compiled to a Java object and resides in the JVM Heap. Certain operations, such as joining two RDDs, require a specific type of RDD called PairRDD. The PairRDD elements are a tuple of Key and Value (both Key and Value can be any serializable object). DataFrames were introduced in Spark version 1.3. The DataFrame is a Spark abstraction for a table, which allows executing analytic SQL queries (SQL 2003) on any data source (Text File, XML, Json etc.). DataFrame elements are untyped. Each element is stored in the generic object Row and has a Multi-level Schema (similar to Parquet files). In order to optimize memory utilization and speed operation, DataFrames are stored in a privately managed memory space outside the JVM Memory heap. Dataset is an attempt to combine the benefits of both RDD and DataFrame. It is strongly typed like RDD, but also supports SQL and stored off-heap like DataFrame. Dataset supports two sets of transformations: Typed (RDD style) an Untyped (DataFrame style). Datasets were first introduced in Spark version 1.6 but were considered too experimental and not recommended for use prior to Spark version 2.x. Since Datasets are stored off-heap, storing an object in a Dataset requires an encoder. Spark provides implicit methods to generate encoders for all basic types and all Scala case classes. In Spark version 2.x DataFrames and Datasets are combined to a single entity, where DataFrame is actually only a defined type Dataset[Row]. Due to that, there is a major API change for DataFrames between Spark 1.x and Spark 2.x. In this document we will refer to Spark 1.x DataFrames as DataFrame v1 and Spark 2.x DataFrames as DataFrame v2. Apache Spark: RDD, DataFrame and Dataset Page 5 of 24

6 4. RDD, DataFrame and Dataset Benchmark Plan 4.1. Benchmark Goals The benchmark goals were to test the following scenarios: 1. Compare performance of basic data manipulation transformations in all distributed collections in Spark version 2.1. We selected to compare filter, sort, and aggregate by key operations, as they represent the most common operations used to manipulate data that involve non-trivial work load (e.g. they usually require shuffling of data between worker nodes). 2. In order to provide insights for those who are considering an upgrade, or are locked into older Spark versions, we compared the performance of the basic transformation in RDD and DataFrame v1 in Spark version 1.6 versus the performance of RDD and DataFrame v2 in Spark version Data manipulation sometimes require combining data from multiple sources. To test this, we compared the performance of joining our large data set to a small and medium data set. Spark documentation recommends using Broadcast to optimize small joins, so we also compared Broadcast vs. small join. 4. Spark documentation also recommends avoiding RDD GroupBy and use ReduceByKey instead, so we compared the performance of both operations. 5. We know that cluster size has an impact on performance, so we compared the same operation on multiple cluster sizes (see Table 2 below). Apache Spark: RDD, DataFrame and Dataset Page 6 of 24

7 4.2. Benchmark Dataset In order to provide a meaningful benchmark, we established the following guidelines for our intended dataset: Data volume should be representative of large data processing. We decided on a minimal threshold of at least a billion rows. Data should be structured, so it could be easily imported into RDD, DataFrame and Dataset. Data should contain numeric fields for GroupBy calulations Data should contain many unique keys so Join and GroupBy operation would be meaningful. Data should be open and available for others who wish to run our benchmarks themselves. We selected to use the AWS hosted Google Books N-Grams dataset ( ). We used the British English, 4 gram set which has 5,325,077,699 rows and about 200GB in uncompressed file size (46.6 GB compressed sequence files). An N-gram is a set of N tokens which appeared consecutively in some written corpus. In the case of the Google Books dataset, that written corpus consisted of many thousands of books which were digitized by Google. For example, the 3-grams for the sentence The yellow dog played fetch. are: [The, yellow, dog]; [yellow, dog, played]; [dog, played, fetch]; and [played, fetch,.] (punctuation also counts as a token and words are case-sensitive). The benchmark s data schema is detailed in Table 1. Apache Spark: RDD, DataFrame and Dataset Page 7 of 24

8 Table 1: Benchmark Schema Field Type Description N-Gram Array[String] array of 4 words Year Int year of the document Occurrences Int Number of times N-Gram appeared in Year Pages Int Number of unique pages containing the N-Gram in Year Books Int Number of unique books containing the N-Gram in Year 4.3. Benchmark Data Load and Caching The AWS N-Gram data set is stored as a sequence file. The AWS sequence file was preprocessed into a parquet file and read into collection in the following order: Read Parquet file into DataFrame From the DataFrame above, a Dataset[N-Gram] was generated. The Dataset[N-Gram] above was used to generate RDD[N-Gram] We initially read the sequence file directly to RDD and use it to generate the other elements. We found its results to be comparable to the method above, so we decided to use the Parquet method instead, as it offered simpler collections creation. In order to prevent misleading results, every transformation benchmark executed on its own, running one transformation at a time, with no chaining of multiple transformations. As only one transformation was run in every run, no caching of collections where performed (i.e. no cache() or persist() performed on any collection). Apache Spark: RDD, DataFrame and Dataset Page 8 of 24

9 4.4. Benchmark Clusters Specifications The benchmark was executed on both AWS and Azure HDInsight clusters of varied sizes, in order to provide more comprehensive benchmark results. The benchmark clusters configuration is detailed in Table 2. Table 2: Benchmark Clusters Configuration Name Environment Master Nodes Worker Nodes Storage Used AWS4 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS8 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS16 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) AWS32 AWS 1 X r4.xlarge (4 Cores, 30GB RAM) HDI4 Azure HDInsight 2 X (4 Cores, 7GB RAM each) HDI13 Azure HDInsight 2 X (4 Cores, 7GB RAM each) 4 X r4.xlarge (4 Cores, 30GB RAM each) 8 X r4.xlarge (4 Cores, 30GB RAM each) 16 X r4.xlarge (4 Cores, 30GB RAM each) 32 X r4.xlarge (4 Cores, 30GB RAM each) 4 X (4 Cores, 28GB RAM each) 13 X (4 Cores, 28GB RAM each) S3 S3 S3 S3 Local Cluster HDFS Local Cluster HDFS Apache Spark: RDD, DataFrame and Dataset Page 9 of 24

10 5. Benchmark Results Following is an analysis of the benchmark results. The full results as well as the benchmark code are detailed in the appendixes Basic Data Manipulation API Comparison in Spark 2.1 We compared the performance of RDD, DataFrame and Dataset in Spark version 2.1. We checked the individual performance of Filter, Aggregate by Key and Sort transformations, as they are the staple Data Manipulation task. They are also no trivial tasks, in that they require major operations on the collections and moving data between different data nodes. We used total CPU time as the benchmark number, as it is a better representation of the actual cluster activity. We normalized all numbers by RDD total CPU time for an ease of comparison between the collections. Chart 1 below shows the benchmark results based on the HDI13 run. Chart 1 Basic API Comparison Spark version % 250% 200% 150% 100% 50% Basic API Compare v2.1 - CPU as % of RDD 0% Filter Aggregate by Key Sort RDD DataFrame v2 Dataset v2 Apache Spark: RDD, DataFrame and Dataset Page 10 of 24

11 As expected, DataFrame v2 has outperformed the older RDD, taking less than half the time to perform sort or aggregate transformation and less than three quarters of the time to filter. Dataset performance provided a very surprising result. Considering that in Spark 2.1 DataFrame is Dataset[Row], the benchmark is actually between the Dataset Typed and Untyped transformations, which on the surface seem equivalent. However, the actual results show that the Dataset Typed Transformation performance has been significantly slower (about six times slower than Untyped transformation used in DataFrame and 2.5 times slower than RDD in performing aggregate or sort transformation). Apache Spark: RDD, DataFrame and Dataset Page 11 of 24

12 5.2. Basic Data Manipulation API Spark 1.6 vs. Spark 2.1 We compared the performance of the same basic API in Spark 1.6. We only tested RDD and DataFrames as Datasets are not recommended for use prior to Spark version 2.x. Due to data capture limitation in Spark 1.6 we could only collect elapsed time for our operations, so it was used for the benchmark. We compared the operations of RDD and DataFrame in Spark version 1.6. We normalized all numbers by RDD elapsed time for ease of comparison. Chart 2 below shows the benchmark results based on the HDI13 run. Chart 2 Basic API Comparison Spark version % 100% 80% 60% 40% 20% Basic API Compare v1.6 - Elapsed % of RDD 0% Filter Aggregate by Key Sort RDD DataFrame v1 We also compared the operations of RDD and DataFrame in Spark version 1.6 vs. the same API in Spark version 2.1. We normalized all numbers showing Spark 2.1 elapsed time as percent of Spark 1.6 time for ease of comparison. Chart 3 below shows the benchmark results based on the HDI13 and HDI4 runs. Apache Spark: RDD, DataFrame and Dataset Page 12 of 24

13 Chart 3 Basic API Comparison Spark version 1.6 vs. Spark version % 100% 80% 60% 40% 20% Spark v2.1 as % of v1.6 - Elapsed 0% Filter Aggregate by Key Sort Sort (HDI4) RDD DataFrame v2 vs. v1 We see a significant improvement in filter time between Spark 2.1 and Spark 1.6. DataFrame and RDD show only slight improvements for the aggregate between the Spark versions. RDD sort is slightly better, while the result for DataFrame sort is mixed. DataFrame sort on HDI13 was 15% slower in Spark version 2.1, while on HDI4 it was slightly faster than the Spark version 1.6 on HDI4. This may be attributed to the fact that we had to use elapsed time for our comparison and the difference in sort elapsed times for HDI13 is less than 22 seconds. Apache Spark: RDD, DataFrame and Dataset Page 13 of 24

14 Total CPU (minutes) Join Comparison When we tried to perform any join operation on our large RDD the Spark operation terminated with an error. Even joins to a ~300 records table could not complete. This is a clear indication that RDDs are not suitable for joining operations. We proceeded in testing join to a small table (~300 rows) in DataFrame and Dataset. We also simulated a join using Broadcast variable and compared it to the small join time. We used total CPU time as the benchmark number. Chart 4 below shows the benchmark results based on the HDI13 run. Chart 4 Join Comparison Spark version 2.1 1,400 1,200 1, Small Join vs. Broadcast 0 Small Join Broadcast DataFrame v2 Dataset v2 For joins it was DataFrame performance provided a very surprising result. As described above, the benchmark is actually between the Dataset Typed and Untyped transformations, which on the surface seem equivalent. However, the actual results show that the Dataset Typed Transformation performance has been significantly faster (about 4.5 times faster than Untyped transformation used in DataFrame). Apache Spark: RDD, DataFrame and Dataset Page 14 of 24

15 Although DataFrame seems to benefit from replacing small join with broadcast, Dataset small join operation in Spark version 2.1 appear to be optimized and equivalent in performance to a Broadcast operation. DataFrame small join performance in Spark version 2.1 seemed oddly slow, so we compared it with DataFrame small join in Spark version 1.6. To our surprise, DataFrame in Spark 1.6 completed the small join operation in half the time than DataFrame in Spark version 2.1, but was almost three times slower than the equivalent typed join transformation of Dataset. DataFrame v1 (Sparx 1.x) and DataFrame v2 (Spark 2.x) are very different under the hood. This is an important point to consider when upgrading your application. You may need to rewrite your code to regain performance that was lost through the upgrade. Apache Spark: RDD, DataFrame and Dataset Page 15 of 24

16 5.4. RDD GroupBy vs. ReduceByKey We tested the Spark documented recommendation of using RDD ReduceByKey instead of GroupBy followed by Reduce. The ReduceByKey completed in less than 15 minutes elapsed time. The GroupBy did not complete after 7.5 hours with executors restarting multiple times due to internal errors. We confirmed that RDD GroupBy should be avoided at all cost. We recognize that there are situations, especially in Data Science calculations, in which GroupBy operation cannot be replaced by ReduceByKey. For example, doing a window operation such as calculating the average annual growth rate of N-Gram s usage, which is calculated for each N-Gram by the following formula based on its years in ascending order: 1 occurrences i+1 occurrences i n i. year i+1 year i We where able to calculate the average annual growth on our large dataset on HDI13 cluster using Dataset typed GroupBy (with iterators over the groups) and DataFrame SQL window functions. Dataset GroupBy completed in 48 minutes and DataFrame SQL window functions (lag) completed in 26 minutes. For cases in which ReduceByKey cannot replace GroupBy your best alternative is to use DataFrame SQL window functions, and if not possible use Dataset typed GroupBy. Apache Spark: RDD, DataFrame and Dataset Page 16 of 24

17 5.5. Increasing Cluster Size Increasing the cluster resources by adding more worker nodes should result in improved performance. The additional nodes add to your total cost of operations, especially in pay-per-use environment such as AWS or Azure HDInsight. We ran the Filter API on a set of AWS clusters doubling the worker nodes after each run. This allowed us to asses the cost-benefit ratio of the increased cluster size. The relevant measure for the cluster effectiveness is the elapsed time, as it represents the total time the cluster spent to produce the result. In pay-per-use environment your cost is also directly related to the total elapsed time the cluster was used. First we looked at the percent marginal improvement in elapsed time when we doubled the worker nodes from 4 to 32 in 4 steps. Chart 5 below shows the result. Chart 5 Percent Elapsed improvement when increasing worker nodes 50% 40% 30% 20% 10% % Improvement - Double the worker nodes 0% RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 We could clearly see that the rate of improvement was declining. We then compared the actual elapsed time improvements in minutes. To assess the cost-effectiveness of the improvement we divided the extra cluster cost of the total marginal improved minutes to get the cost per one minute of marginal improvement. Charts 6 and 7 below shows the result. Apache Spark: RDD, DataFrame and Dataset Page 17 of 24

18 Chart 6 Total Elapsed improvement when increasing worker nodes Minutes Improvement - Double the worker nodes RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 Chart 7 Cost per Elapsed minute improvement when increasing nodes $2.50 Cost per Min Improved $2.00 $1.50 $1.00 $0.50 $- RDD DataFrame v2 DataSet v2 4 to 8 8 to to 32 Apache Spark: RDD, DataFrame and Dataset Page 18 of 24

19 The result of the cost-benefit comparison of cost per marginal minute improvement shows an exponentially declining benefit of increasing the cluster size and resources. The optimal cost-benefit in our situation seems to be between the 8 and 16 worker nodes. There seems to be very little benefit in increasing the working nodes in the cluster from 16 worker nodes to the 32 worker nodes. Apache Spark: RDD, DataFrame and Dataset Page 19 of 24

20 6. About SWI SWI offers deep business expertise and a flawless track record of delivery for high impact technology projects in the Energy and Financial sectors. SWI has over 35 years of experience offering dependable technology consulting services. Since SWI was founded in 1978, we have provided solutions for nuclear, financial, energy and transportation organizations industries where dependability and business continuity is critical. Over the last three decades, SWI has developed a distinguished reputation based on a flawless track record of solutions, long-term partnerships with our clients and deep domain expertise in the energy and financial sectors. SWI s history provides a solid foundation for growth, building on the company s notable achievements, including: Long-term relationships with clients that have grown through trust and a shared commitment to excellence History of delivering high-quality solutions on schedule and on budget Participation in professional associations and standards organizations as part of our commitment to continuous improvement Reputation for best-in-class solutions We feel strongly that by building long-term partnerships with our clients, based on trust and an ongoing commitment to excellence, we can accomplish great things. WHAT WE DO SWI provides a full range of consulting services, including custom software development, systems integration, project management, strategic planning, business analysis, and quality assurance solutions. We strive to provide our clients with the highest quality solutions to gain the best results for their business. WHO WE ARE Located in midtown Toronto, SWI is an IT professional services firm with a 35-year track record of delivering high-value solutions to enterprise customers in the Energy and Financial sectors. SWI has a reputation for maintaining deep industry knowledge and extensive experience in architecting and implementing complex IT systems. Apache Spark: RDD, DataFrame and Dataset Page 20 of 24

21 Appendix A - Benchmark Detailed Results General notes on all results: All times are in minutes Spark version 1.6 does not provide total CPU time or Garbage Collection (GC) time. Dataset was only tested in Spark version 2.1 A.1. Result for Filter benchmark Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS4 v2.1 1, , , AWS8 v2.1 1, , , AWS16 v2.1 1, , , AWS32 v2.1 1, , , HDI4 v HDI4 v1.6 N/A N/A N/A N/A N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 9.64 N/A N/A 7.39 N/A N/A N/A A.2. Result for Aggregate By Key benchmark Note: AWS4 had execution errors during the calculation and is omitted from result list. Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS8 v2.1 1, , AWS16 v2.1 1, , AWS32 v2.1 1, , HDI4 v HDI4 v1.6 N/A N/A N/A N/A 5.99 N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 6.72 N/A N/A N/A N/A N/A Apache Spark: RDD, DataFrame and Dataset Page 21 of 24

22 A.3. Result for Sort benchmark Note: AWS4 had execution errors during the calculation and is omitted from result list. Cluster Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed AWS8 v2.1 1, , AWS16 v2.1 1, , AWS32 v2.1 1, , HDI4 v HDI4 v1.6 N/A N/A N/A N/A 5.97 N/A N/A N/A HDI13 v HDI13 v1.6 N/A N/A 6.86 N/A N/A 2.60 N/A N/A N/A A.4. Result for Join benchmark All Join tests where run on the HDI13 cluster. The Join benchmark tested three types of joins: Small: Join of our big N-Gram collection with a 300 rows reference table. Broadcast: using Broadcast of the 300 rows instead of Join Medium: Join of our big N-Gram collection with a medium (1GB) table. Notes: Error indicated the Join operation failed to complete and exited with an error. Since small join failed for RDD, we did not run broadcast on it. Dataset was not run for Spark version 1.6 Join Type Spark Version Total CPU RDD DataFrame Dataset GC Elapsed Total CPU GC Elapsed Total CPU GC Elapsed Small v2.1 Error Error Error 1, Broadcast v2.1 N/A N/A N/A Medium v2.1 Error Error Error Error Error Error Small v1.6 Error Error Error N/A N/A N/A N/A N/A Apache Spark: RDD, DataFrame and Dataset Page 22 of 24

23 Appendix B - Benchmark Code B.1. RDD / Dataset object case class NGram(row: Long, gram: Seq[String], year: Int, occurances: Long, pages: Long, books: Long) { } def +(other: NGram): NGram = { NGram(max(row, other.row), gram, max(year, other.year), occurances + other.occurances, pages + other.pages, books + other.books) } B.2. Filter Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code rdd.filter(_.gram.forall(containslowercaselettersonly)) df.filter(concat_ws("", data.col("gram")).rlike("^[a-z]++$")) ds.filter(_.gram.forall(containslowercaselettersonly)) B.3. Aggregate By Key Code Collection Type RDD[N-Gram] DataFrame Code rdd.map(row => row.gram -> row).reducebykey(_ + _) df.groupby("gram").agg(max("year").as("year"), sum("occurances").as("occurances")) Dataset[N-Gram] ds.groupbykey(_.gram).mapgroups { case (_, rows) => rows.reduce(_ + _) } B.4. Sort Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code rdd.sortby(_.occurances, ascending = false) df.sort(data.col("occurances").desc) ds.sort(ds.col("occurances").desc) Apache Spark: RDD, DataFrame and Dataset Page 23 of 24

24 B.5. Join Code Collection Type RDD[N-Gram] DataFrame Dataset[N-Gram] Code val keyeddata = rdd.map(row => row.year -> row) val keyedreference = refrdd.map(row => row.year -> row) keyeddata.join(keyedreference).map{ case (_, (v1, v2)) => (Some(v1), Some(v2)) } bigdf.join(smalldf, "year") bigds.joinwith(smallds, bigds.col("year") === smallds.col("year"), "inner") B.6. Broadcast Code Collection Type Code RDD[N-Gram] DataFrame rdd.map(ngram => (ngram, reference.value.get(ngram.year))) df.withcolumn("yearinwords",bc.value.get(df.col("year"))) Dataset[N-Gram] ds.map(ngram => (ngram, bc.value.get(ngram.year)) ) Apache Spark: RDD, DataFrame and Dataset Page 24 of 24

Accelerate Big Data Insights

Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not