A very short introduction

Size: px

Start display at page:

Download "A very short introduction"

Ruth Knight
5 years ago
Views:

1 A very short introduction

2 General purpose compute engine for clusters batch / interactive / streaming used by and many others

3 History developed in UC Berkeley joined the Apache foundation in 2013 implemented in Scala most recent: (Oct. 2015) APIs: Scala, Java, Python, R growing additional components: GraphX MLlib Spark Streaming Spark SQL

4 Tools Interactive Scala shell spark-shell Submit Spark jobs spark-submit Interactive Python shell pyspark IPYTHON=1 pyspark IPYTHON_OPTS="notebook" pyspark Interactive R shell sparkr

5 Tools Spark runs a console on (by default) to monitor the runtime

6 RDDs Resilient Distributed Datasets

7 RDD: Resilient Distributed Dataset fundamental abstraction object used in Spark read-only collection of objects partitioned across a set of machines can be rebuilt if a partition is lost

8 RDD: Resilient Distributed Dataset RDD partition memory partition memory partition memory partition memory read-only collection of objects partitioned across a set of machines / nodes can be rebuilt if a partition is lost

9 RDD: Resilient Distributed Dataset Created by parallelizing a collection pyspark.context.sparkcontext words = ['Foo', 'bar', '_baz', 'BBQ!'] words_rdd = sc.parallelize(words) words_rdd is a ParallelCollectionRDD[8] at parallelize at PythonRDD.scala:423 read-only collection of objects partitioned across a set of machines can be rebuilt if a partition is lost

10 RDD: Resilient Distributed Dataset Created from a file / HDFS storage words_rdd = sc.textfile(./words.txt ) words_rdd is a MapPartitionsRDD[19] at textfile at NativeMethodAccessorImpl.java:-2 read-only collection of objects partitioned across a set of machines can be rebuilt if a partition is lost

11 RDD: Resilient Distributed Dataset Created by transforming an existing RDD words_with_bs_rdd = words_rdd.filter( lambda w: 'b' in w.lower() ) words_with_bs_rdd is a PythonRDD[22] at RDD at PythonRDD.scala:43 read-only collection of objects partitioned across a set of machines can be rebuilt if a partition is lost

12 RDD: Resilient Distributed Dataset RDDs are evaluated in a lazy fashion words = ['Foo', 'bar', '_baz', 'BBQ!'] words_rdd = sc.parallelize(words) words_with_bs_rdd = words_rdd.filter( lambda w: 'b' in w.lower() ) words_with_bs_rdd.collect() ['bar', '_baz', 'BBQ!'] action triggers the actual computation

13 Transformations

14 Transformations map ['Foo', 'bar', '_baz', 'BBQ!'] import re def stripnonalpha(s): return re.sub('[^a-za-z]+', '', s) only_words_rdd = (words_rdd.map(stripnonalpha)) get RDD contents as a list only_words_rdd.collect() ['Foo', 'bar', 'baz', 'BBQ']

15 Transformations map ['Foo', 'bar', '_baz', 'BBQ!'] import re def stripnonalpha(s): return re.sub('[^a-za-z]+', '', s) lower_words_rdd = (words_rdd.map(stripnonalpha).map(lambda w: w.lower())) lower_words_rdd.collect() ['foo', 'bar', 'baz', 'bbq']

16 Transformations map ['foo', 'bar', 'baz', 'bbq'] letters_rdd = (lower_words_rdd.map(lambda w: list(w)) letters_rdd.collect() [['f', 'o', o ], ['b', 'a', r'], ['b', 'a', z'], ['b', 'b', 'q']]

17 Transformations flatmap ['foo', 'bar', 'baz', 'bbq'] letters_rdd = (lower_words_rdd.flatmap(lambda w: list(w)) letters_rdd.collect() ['f', 'o', 'o', 'b', 'a', 'r', 'b', 'a', 'z', 'b', 'b', 'q']

18 Transformations reducebykey ['f', 'o', 'o', 'b', 'a', 'r', 'b', 'a', 'z', 'b', 'b', 'q'] letter_counts_rdd = (letters_rdd.map(lambda l: (l,1)).reducebykey(lambda i,j: i+j)) letter_counts_rdd.collect() [('a', 2), ('q', 1), ('o', 2), ('r', 1), ('b', 4), ('z', 1), ('f', 1)]

19 Transformations sortby [('a', 2), ('q', 1), ('o', 2), ('r', 1), ('b', 4), ('z', 1), ('f', 1)] (letter_counts_rdd.sortby(lambda l: l[1], ascending=false).collect()) [('b', 4), ('a', 2), ('o', 2), ('q', 1), ('r', 1), ('z', 1), ('f', 1)]

20 Transformations groupbykey def parse_tx(tx): ts = [t.strip() for t in tx.split(',')] name = ts[0] items = ts[1:] return [(item, name) for item in items] txs = sc.parallelize( ['Bart, beer, wine, chips, diapers', 'Kurt, chips, meat, foo', 'Gert, beer, bar, cheese ]) itxs = (txs.flatmap(parse_tx).groupbykey().map(lambda (name, items): (name, list(items)))) itxs.cache() pretty-please, if possible, keep in mem. for further processing

21 Transformations cartesian movies = sc.parallelize(['batman', 'Iron man, 'Everest', 'Rambo']) users = sc.parallelize(['bob', 'Olaf', 'Owen', 'Tom']) movies.cartesian(users).collect() [( Batman, Bob'),('Batman', Olaf'),('Batman', Owen'),('Batman', 'Tom'), ('Iron man','bob'),('iron man','olaf'),('iron man','owen'),('iron man','tom'), ('Everest', Bob'),('Everest', 'Olaf'),('Everest', 'Owen'),('Everest', 'Tom'), ('Rambo', 'Bob'),('Rambo', 'Olaf'),('Rambo', 'Owen'),('Rambo', 'Tom')]

22 Other transformations selecting elements from RDDs filter distinct sample takesample combining RDDs join sorting sortbykey map with an external tool pipe zip[withindex / WithUniqueId] union intersection subtract & many others

23 Actions

24 Actions count ['foo', 'bar', 'baz', 'bbq'] lower_words_rdd.count() 4

25 Actions collect ['foo', 'bar', 'baz', 'bbq'] lower_words_rdd.collect() ['foo', 'bar', 'baz', 'bbq']

26 Other actions basic statistics max / min mean[approx] / stdev sum[approx] count[approx] countbykey / Value reduce fold aggregate take[ordered] & many others

27 Shared variables

28 Broadcast variables read-only colors = sc.broadcast({'red': 0xff0000', 'green': 0x00ff00', 'purple': '0x9933cc'}) vegetables = sc.parallelize([ ( tomato', red'), ('cucumber', green'), ('banana', purple')]) (vegetables.map(lambda (k,v): (k, colors.value[v])).collect()) [('tomato', '0xff0000'), ('cucumber', '0x00ff00'), ('banana', '0x9933cc')]

29 Accumulators write-only letters_wasted = sc.accumulator(0) def func(s): global letters_wasted letters_wasted += len(s) users = sc.parallelize(['bart','bert','bort']) users.foreach(func) print letters_wasted.value applies func to each element of RDD 12 only in main driver

30 Self study Intro to Apache Spark (Brain-Friendly Tutorial): Parallel Programming with Spark (Part 1 & 2) Spark: cluster computing with working sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces