COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Size: px

Start display at page:

Download "COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?"

Angelica Black
5 years ago
Views:

1 COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications Streaming data processing Keep data in memory across different functions Sparks works across many environments Standalone, Hadoop, Mesos, Spark support accessing data from diverse sources ( HDFS, HBase, Cassandra, ) 1

2 Three modes of execution Spark shell Spark scripts Spark code What is SPARK (II) API defined for multiple languages Scala Python Java R A couple of words on Scala Object-oriented language: everything is an object and every operation is a method-call. Scala is also a functional language Functions are first class values Can be passed as arguments to functions Functions have to be free of side effects Can defined functions inside of functions Scala runs on the JVM Java and Scala classes can be freely mixed 2

Spark Essentials Spark program has to create a SparkContext object, which tells Spark how to access a cluster Automatically done in the shell for Scala or Python: accessible through the sc variable

3 Spark Essentials Spark program has to create a SparkContext object, which tells Spark how to access a cluster Automatically done in the shell for Scala or Python: accessible through the sc variable Programs must use a constructor to instantiate a new SparkContext gabriel@whale:> pyspark Using Python version (default, Nov :55:38) SparkSession available as 'spark'. >>> sc <pyspark.context.sparkcontext object at 0x2609ed0> Spark Essentials The master parameter for a SparkContext determines which resources to use, e.g. whale>pyspark --master yarn 3

4 SPARK cluster utilization 1. master connects to a cluster manager to allocate resources across applications 2. acquires executors on cluster nodes processes run compute tasks, cache data 3. sends app code to the executors 4. sends tasks for the executors to run Master URL local local[k] spark://host:port mesos://host:port yarn SPARK master parameter Meaning Run Spark locally with one worker thread (i.e. no parallelism at all). Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://... Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR. 4

5 Programming Model Resilient distributed datasets (RDDs) Immutable collections partitioned across cluster nodes that can be rebuilt if a partition is lost Created by transforming data in stable storage using data flow operators (map, filter, group-by, ) Two types of RDDs defined today: parallelized collections take an existing collection and run functions on it in parallel Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop Programming Model (II) Two types of operations on RDDs: transformations and actions transformations are lazy (not computed immediately) the transformed RDD gets recomputed when an action is run on it (default) instead they remember the transformations applied to some base dataset optimize the required calculations recover from lost data partitions 5

6 Programming Model (III) Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, PySpark supports only text files, Scala Spark also supports SequenceFiles, and other Hadoop InputFormat, e.g., local file system, Amazon S3, Hypertable, HBase, etc. Creating a simple RDD >>> numbers = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> numbers.collect() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> numbers.count() 10 Returns a list that contains all elements in this RDD Note: should only be used if the resulting array is expected to be small Return the number of elements in this RDD. >>> numbers2 = sc.parallelize(range(0, )) >>> numbers2.getnumpartitions() 4 >>> quit() Returns the number splits in this RDD. 6

7 Creating an RDD from an Input File hdfs dfs -cat /gabriel/simple-input.txt line1 value1 line2 value2 line3 value3 line4 value4 >>> text=sc.textfile("/gabriel/simple-input.txt") >>> text.collect() [u'line1\tvalue1', u'line2\tvalue2', u'line3\tvalue3', u'line4\tvalue4'] >>> text.count() 4 >>> lambda: anonymous functions in python >>> lines = text.map(lambda w : (w, 1) ) Map every instance w >>> lines.collect() to a pair of (w,1) [(u'line1\tvalue1', 1), (u'line2\tvalue2', 1), (u'line3\tvalue3', 1), (u'line4\tvalue4', 1)] flatmap: map every instance l to a sequence of objects returned by the split() operation >>> words = text.flatmap(lambda l: l.split()) >>> words.collect() [u'line1', u'value1', u'line2', u'value2', u'line3', u'value3', u'line4', u'value4'] >>>>>> wcount=words.map( lambda w: (w,1)) >>> wcount.collect() [(u'line1', 1), (u'value1', 1), (u'line2', 1), (u'value2', 1), (u'line3', 1), (u'value3', 1), (u'line4', 1), (u'value4', 1)] Shorter version wcounts=text.flatmap(lambda l: l.split()).map(lambda w: (w,1)) 7

8 Transformations Transformations 8

9 Actions Slide based on a talk found at Actions 9

10 Pyspark wordcount example >>> from operator import add >>> text=sc.textfile("/gabriel/simple-input.txt") >>> words = text.flatmap(lambda l: l.split()).map (lambda w: (w, 1) ) >>> counts = words.reducebykey(add) >>> counts.saveastextfile( /gabriel/wordcount.out") reducebykey(func, numpartitions=none, partitionfunc=<f>) Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a combiner in MapReduce. Output will be partitioned with numpartitions partitions, or the default parallelism level if numpartitions is not specified. gabriel@whale:~> hdfs dfs -ls /gabriel/wordcount.out Found 5 items -rw-r--r-- 3 gabriel hadoop 0 /gabriel/wordcount.out/_success -rw-r--r-- 3 gabriel hadoop 29 /gabriel/wordcount.out/part rw-r--r-- 3 gabriel hadoop 29 /gabriel/wordcount.out/part rw-r--r-- 3 gabriel hadoop 29 /gabriel/wordcount.out/part rw-r--r-- 3 gabriel hadoop 29 /gabriel/wordcount.out/part gabriel@whale:~> hdfs dfs -cat /gabriel/wordcount.out/part (u'value3', 1) (u'line3', 1) 10

11 >>> from operator import add >>> text=sc.textfile("/gabriel/simple-input.txt") >>> words = text.flatmap(lambda line:line.split()). map(lambda w: (w, 1) ) >>> counts = words.reducebykey(add, numpartitions=1) >>> counts.saveastextfile( /gabriel/wordcount2.out") gabriel@whale:~> hdfs dfs -ls /gabriel/wordcount2.out Found 2 items -rw-r--r- 3 gabriel hadoop 0 /gabriel/wordcount2.out/_success -rw-r--r- 3 gabriel hadoop 116 /gabriel/wordcount2.out/part Persistence Spark can persist (or cache) a dataset in memory across operations Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset often making future actions more than 10x faster The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it >>> text=sc.textfile( /Gabriel/simple-input.txt") >>> lines=text.flatmap(lambda l: l.split()).cache() 11

12 Broadcast variables Broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks For example, to give every node a copy of required accuracy, number of iterations, etc. Spark attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost >>>b = sc.broadcast([1, 2, 3]) >>>b.value [1, 2, 3] >>> Literature A large number of books available meanwhile for Spark and friends. 12

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based