CS 378 Big Data Programming. Lecture 20 Spark Basics

Size: px

Start display at page:

Download "CS 378 Big Data Programming. Lecture 20 Spark Basics"

Randolph Bradford
5 years ago
Views:

1 CS 378 Big Data Programming Lecture 20 Spark Basics

2 Review Assignment 9 Download and run Spark Controlling logging output CS Fall 2018 Big Data Programming 2

3 Review - Spark RDDs Resilient Distributed Dataset One RDD has one or more pariions ParIIons are distributed across machines Rebuilt from base data on failure (versus replicaion) Lazy evaluaion created/computed on demand RDD types offer various funcions map, reduce groupby, reducebykey joins (inner, letouterjoin, rightouterjoin) filter, sample CS Fall 2018 Big Data Programming 3

4 Review - Spark Programs A Spark program defines: RDDs Input from external sources Produced by a transformaion TransformaIons Produce a new RDD from the input RDD(s) AcIons Compute something from the input RDD Return non-rdd objects (e.g., a number) Write an RDD to external storage CS Fall 2018 Big Data Programming 4

5 Spark AcIons TransformaIons define data flow (data lineage) AcIons on RDDs iniiate computaion CounIng rdd.count() Extract some elements rdd.take(10) Get all elements rdd.collect() Careful, as this assembles the enire RDD. Filter first. CS Fall 2018 Big Data Programming 5

6 Spark TransformaIons TransformaIons take funcions as arguments This funcion defines the details of the transform Filter Apply a funcion to each element of an RDD, return those elements that evaluate to true Source RDD element type: T Result RDD element type: T Java funcion (class) type: Function<T, Boolean> Java method: Boolean call(t t) CS Fall 2018 Big Data Programming 6

7 Spark TransformaIons Map Apply a funcion to each element of an RDD, return the result of applying the funcion Source RDD element type: T Result RDD element type: R Java funcion (class) type: Function<T, R> Java method: R call(t t) CS Fall 2018 Big Data Programming 7

8 Spark TransformaIons Flat Map Apply a funcion to each element of an RDD, return the result of applying the funcion (an Iterator) Source RDD element type: T Result RDD element type: R Java funcion (class) type: FlatMapFunction<T, R> Java method: Iterator<R> call(t t) Spark then iterates through each iterable returned, to create an RDD with element type R CS Fall 2018 Big Data Programming 8

9 Spark TransformaIons Map to Pair Apply a funcion to each element of an RDD, return the result of applying the funcion (a key and a value) Source RDD element type: T Result RDD element type: <K, V> Java funcion (class) type: PairFunction<T, K, V> Java method: Tuple2<K, V> call(t t) CS Fall 2018 Big Data Programming 9

10 Spark TransformaIons Reduce by Key Apply a funcion to pairs of values with the same key, return the result (key and a reduced value) Source RDD element type: <K, V> Result RDD (JavaPairRDD) element type: <K, V> Java funcion (class) type: Function2<V, V, V> Java method: V call(v v1, V v2) CS Fall 2018 Big Data Programming 10

11 Spark Code Example WordCount again AlternaIve implementaions of WordCount CS Fall 2018 Big Data Programming 11

12 Assignment 10 Inverted Index in Spark Data set example Genesis:46:14 And the sons of Zebulun; Sered, and Elon, and Jahleel. Output for this document (one verse) How many outputs for this verse? What should the keys be? What should the value(s) be? CS Fall 2018 Big Data Programming 12

13 Assignment 10 Input record parsing Document-ID: book:chapter:verse Text string that is the verse Verse text parsing: Remove punctuaion, lowercase all words Output: Word, list of book:chapter:verse (no duplicates) Sort the book:chapter:verse references Extra credit: output a file ordered by the number of verses referenced for a word (descending) CS Fall 2018 Big Data Programming 13

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow Spark supports also RDDs of key-value pairs They are called pair RDDs Pair RDDs are characterized by specific operations reducebykey(), join, etc. Obviously, pair RDDs are characterized also by the operations