Outline. CS-562 Introduction to data analysis using Apache Spark

Size: px

Start display at page:

Download "Outline. CS-562 Introduction to data analysis using Apache Spark"

Sharlene Potter
5 years ago
Views:

1 Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.: Kostas Solomos / Moustakas Serafeim Core Functionality of Apache Spark Simple tutorial

2 Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How to split problem across nodes? Must consider network & data locality Data flow vs. traditional network programming» How to deal with failures? (inevitable at scale)»even worse: stragglers (node not failed, but slow)»ethernet networking not fast» Have to write programs for each machine

3 Data Flow Models Why Use a Data Flow Engine? Restrict the programming interface so that the system can do more automatically Ease of programming» High-level functions instead of message passing Express jobs as graphs of high-level operators»system picks how to split each operator into tasks and where to run each task»run parts twice fault recovery Wide deployment» More common than MPI, especially near data Scalability to very largest clusters»even HPC world is now concerned about resilience Biggest example: MapReduce Examples: Pig, Hive, Scalding, Storm

4 Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing Limitations of MapReduce»State between steps goes to distributed file system»slow due to replication & disk storage

5 Result While MapReduce is simple, it can require asymptotically more communication or I/O

6 What is Apache Spark? Programming Framework offering abstraction and parallelism Motivation for spark Software Engineering Hadoop code base is huge Contributions/Extensions are difficult Java only System/Framework One of the most efficient programming frameworks for clusters Unified pipeline Simplified data flow Faster processing speed Data abstraction It hides complexities of: Fault Tolerance Slow machines Network Failures New fundamental data abstraction that is easy to extend with new operators allows for a more descriptive computing model

8 Spark basic features Fast Run machine learning iterative programs...up to 100x faster that Hadoop in memory...or 10x faster on disk Avoid materializing data on HDFS after each iteration Easy to use Fluent Scala/Java/Python/R API Interactive shell (repl) 2-5x less code (than Hadoop MapReduce) But most important Provides high-level APIs and tools for many different languages Spark speaks your language

9 Spark is sophisticated Can run today s most advanced algorithms

Transformations apply user code to distributed data in

10 Spark: Descriptive Computing Model Organize computation into multiple stages in processing pipeline Transformations apply user code to distributed data in parallel Actions assemble final output of an algorithm from distributed data

11 Resilient Distributed Datasets (RDDs) The main abstraction of Spark Immutable once they are created RDD: Partitions RDDs are automatically distributed across the network by means of partitions - A partition is a logical division of data - RDD data is just a collection of partitions - Spark automatically decides the number of partitions when They carry their lineage for fault tolerance Recipe for reconstruction Think of a big collection (i.e. list) that is partitioned and spread across a cluster Diverse set of operators that offers huge abstraction creating an RDD - All input, intermediate and output data will be presented as partitions - Partitions are basic units of parallelism - A task is launched per each partition

Transformations are executed when an action runs

12 RDD: Partitions Operations on RDDs Two types: transformations and actions Transformations are lazy Transformations are executed when an action runs Operator cache persists distributed data in memory or disk

lazily evaluated which allow for optimizations to take place before execution The lineage keeps track of all transformations that have to be applied when an action

13 RDD Cache - rdd.cache() RDD operations - Transformations As in relational algebra, the application of a transformation to an RDD yields a new RDD (immutability) Transformations are lazily evaluated which allow for optimizations to take place before execution The lineage keeps track of all transformations that have to be applied when an action happens If we need the results of an RDD many times, it is best to cache it RDD partitions are loaded into the memory of the nodes that hold it avoids re-computation of the entire lineage in case of node failure compute the lineage again

14 Useful Transformations on RDDs Some more useful transformations on RDDs

15 RDD Common Transformations RDD operations - Actions Unary RDD Result rdd. map(x => x * x) {1, 2, 3, 3} {1, 4, 9, 9} rdd.flatmap(line => line.split(" ")) {"hello world", "hi"} {"hello, world", "hi"} rdd.filter(x => x!= 1) {1, 2, 3, 3} {2, 3, 3} rdd.distinct () {1, 2, 3, 3} {1, 2, 3} Binary RDD Other Result rdd.union (other) {1, 2, 3} {3,4,5} {1,2,3,3,4,5} rdd.intersection(other) {1, 2, 3} {3,4,5} {3} rdd.subtract(other) {1, 2, 3} {3,4,5} {1, 2} rdd.cartesian(other) {1, 2, 3} {3,4,5} {(1,3),(1,4), (3,5)} Apply transformation chains on RDDs, eventually performing some additional operations (e.g. counting) i.e. trigger job execution Used to materialize computation results Some actions only store data from the RDD upon which the action is applied and convey it to the driver

16 Some Useful Actions on RDDs RDD Actions reduce(): Takes a function that operates on two elements of the type in your RDD and returns a new element of the same type. The function is applied on all elements. collect(): returns the entire RDD s contents (commonly used in unit tests where the entire contents of the RDD are expected to fit in memory). The restriction here is that all of your data must fit on a single machine, as it all needs to be copied to the driver. take(): returns n elements from the RDD and tries to minimize the number of partitions it accesses. No expected order count(): returns the number of elements 32

17 Hadoop: Bloated Computing Model RDD Actions RDD Result rdd.reduce((x, y) => x + y) {1,2,3} 6 Example RDD Result rdd.collect() {1,2,3} {1,2,3} rdd.take(2) {1,2,3,4} {1,3} rdd.count() {1,2,3,3}

Spark: Your new friend Install Spark Windows: https://dzone.

18 Spark: Your new friend Install Spark Windows: Linux: Follow tutorial of exercise

19 Special Thanks CS543 Presentations Stanford University, Introduction to Apache Spark Resilient Distributed Datasets,Henggang Cui Coursera Introduction to Apache Spark,University of California,Databricks

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How