Processing of big data with Apache Spark

Size: px

Start display at page:

Download "Processing of big data with Apache Spark"

Brooke Todd
5 years ago
Views:

1 Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski

2 AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2

3 WHAT IS APACHE SPARK? Engine for processing of large-scale data Open source Interact with Java, Scala, Python, and R Run as Standalone or on YARN, Kubernetes, and Mesos Access HDFS, HBase, Cassandra, S3, and etc. 3

4 SPARK ARCHITECTURE 4

5 RESILIENT DISTRIBUTED DATASET (RDD) Characteristics: Immutable, distributed, partitioned, and resilient API: 1. Transformations: map(), filter(), distinct(), union(), subtract(), and etc. 2. Actions: reduce(), collect(), count(), first(), take(), and etc. 5

6 RDD OPERATIONS Transformations are executed on workers Actions may transfer data from the workers to the driver collect() sends all the partitions to the single driver Persistence: persist() and cache() 6

7 AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 7

8 SPARK VS MAPREDUCE Spark Real time, streaming Processes data in-memory Handle structures which could not be decomposed to key-value pairs MapReduce Batch mode, not real-time Persist on disk after map operation Key-value pairs 8

9 AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 9

10 APPLICATION REQUIREMENTS Verify that application is thread-safe Use synchronization blocks appropriately Avoid duplication of objects Try to use array of objects and primitive types Avoid unneeded data in the objects Always remember that application is executed in parallel! 10

11 APPLICATION PIPELINE Define the application pipeline with usage of SparkContext object Encapsulate the common data needed through all pipeline steps Prepare the common data and broadcast it through the workers as needed 11

12 AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 12

13 Worker 1 Load Content Process & Enhance Persist Content List of Files Balance Driver Process Metrics Worker N Load Content Process & Enhance Persist Content 13

14 AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 14

15 DEPENDENCY ISSUES Example: Log4j 1.x Spark Log4j 2.x Application (Async Loggers) Application fails at the very beginning Resolution: Shading dependencies Provide the dependencies in --jars property and add spark.{driver,executor}.userclasspathfirst=true properties 15

16 MEMORY ISSUES Example: Default usage of 1Gb RAM per executor Executors fail with OOM error, thus application fails Resolution: Verify cluster available memory Monitor and measure memory usage Tune per application case 16

17 PERFORMANCE ISSUES Example: Application execution time is taking too long for simple set of data Last task executing time is taking too long Resolution: Verify the partitioning Adjust the processing time of each task 17

18 APPLICATION ISSUES Example: Default Java serialization is being used Serialization time is taking too long Resolution: Verify the objects data and data structures used Use Kryo serialization 18

19 API ISSUES Three to four month cycle releases Lots of hood changes Verification if application is affected 19

20 20

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based