Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed

Size: px

Start display at page:

Download "Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed"

August Lyons
5 years ago
Views:

1 Big Data Analytics with Apache Spark Machiel Jansen Matthijs Moed

2 Schedule 10:15-10:45 General introduction to Spark 10:45-11:15 Hands-on: Python notebook 11:15-11:30 General introduction to Spark (continued) 12:30-13:15 Spark DataFrames 13:15-14:00 Hands-on: Spark DataFrames notebook 14:00-15:00 Hadoop Map/Reduce 15:00-15:30 Spark RDD 15:30-16:15 Hands-on: Spark RDD notebook 16:15-17:00 Machine learning

can (easily) run binary (unix) programs It provides a simplified (limited)

3 What is Apache Spark? Spark is a software development framework - not an environment in which you can (easily) run binary (unix) programs It provides a simplified (limited) and therefore easier way of writing distributed data intensive applications Spark runs on commodity hardware or in cloud environments

4 Scaling The need to react to increased load Keep the application/service running with similar performance under heavy load (many users, lots of data)

5 Scaling traditional applications Now the one running the application needs to: Distribute and split data Handle faults and errors inherent with scale Submit and track applications Use files or relational database (fixed schema's)

6 An Example Consider from a Tweet we are interested in finding: Names of persons Names of organisations Locations and places I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting!

7 Anatomy of a Tweet

8 A Straightforward Implementation Store tweets on disk Small Python program uses NLTK and Stanford NER to tag Write output back to disk

9 But

10 Scaling Bottlenecks Store tweets on disk: storage will eventually fill up Small Python program: it can do a tweet every few msecs/secs so need to run separate processes Write output back to disk: it will eventually fill Run separate processes: they all need input

11 Scaling up or out

12 State Remember things in between service requests Example: the internet shopping basket If different machines service requests they should share the state of the shopping basket

13 Shared mutable state Example: multiple processes can change a file or database or a variable When scaling out it becomes difficult to guard consistency and unwanted results Locking mechanisms should be in place and make reasoning about software complicated

14 Try to avoid shared state "Working with distributed systems is fundamentally different from writing software on a single computer and the main difference is that there are lots of new and exciting ways for things to go wrong"

15 Parallel programming is hard

Latencies inside a machine Latencies: http://bit.

16 Latencies inside a machine Latencies: Courtesy Ben Stopford:

17 Latency memory x

18 Humanized latency numbers x 1 billion

19 Spark - a general framework Spark aims to generalise MapReduce to support new applications with a more efficient engine, and simpler for the end users. Write programs in terms of distributed datasets and operations on them. Accessible from multiple programming languages: Scala Java Python R (only dataframes)

21 Jupyter

22 Hands-on: Notebooks Jupyter notebooks: 01-python.ipynb

23 Spark components

25 Two main API's Low level RDDs - Python, Scala, Java - no R support Structured API DataFrames - higher level - R support, SQL and ML

26 Spark modes SparkContext/SparkSession objects contains information about the cluster and is the linking pin between your code and the cluster. Local mode: single machine, using multiple cores. For testing and training purposes. Cluster mode: dedicated Spark cluster (also on clouds) Hadoop/cluster mode: Use Hadoop YARN to deploy cluster

27 An executing application

28 Spark UI The Spark UI provides a visual way to understand running applications and metrics about your Spark workload. Available on port 4040

29 Spark UI

30 Languages and CLI Scala Java Python R (only dataframes) Command line: spark-submit

31 Big Data = Big Data transfers Large, secure data transfer between institutes SURFnet network 10 Gbit/s Research Data Zone

32 Break

33 DataFrames A DataFrame is a distributed collection of data organized into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python. DataFrames can be constructed from a wide array of sources such as: structured data files, external databases, or existing RDDs.

34 DataFrames Collection of Row Objects with schema DataFrames are immutable Transformations and actions Lazy but schema is checked eagerly Distributed over machines in cluster

35 Transformations and actions Transformation: transform one or more DataFrames into another dataframe (lazily computed) Action: send data to driver (to your programming environment), write data to disk or database (triggers computation)

36 Partitions A partition is a collection of rows that sit on one physical machine in our cluster. A DataFrame s partitions represent how the data is physically distributed across your cluster of machines during execution. With DataFrames, we do not (for the most part) manipulate partitions individually. We simply specify high level transformations of data in the physical partitions and Spark determines how this work will actually execute on the cluster. (taken from Spark, The definitive guide - Chambers - OReilly 2018)

37 DataFrame API Relational in flavour: select, groupby, orderby, where/filter, join, limit etc. Possibility to define User Defined Functions (UDFs) Optional: use SQL directly and work with tables/views

39 Data Frame phone_df Transformations select filter groupby max Data Frame Data Frame Grouped Data Data Frame

40 Data Frame phone_df Actions select filter groupby max show Data Frame Data Frame Grouped Data Data Frame Action Python

41 An Executing Application

42 An Executing Application

43 Spark: What Runs Where? At first glance: Spark code and RDD variables look local Important to keep track of local variables and references to distributed data (variables of type RDD)

44 Get data to screen returns a DataFrame Actions: data transfer to driver

45 Show and take show() shows 20 rows by default

46 topandas() Convert Spark DataFrame to Pandas data frame

47 Optimizations Spark will optimize computation by building plans, making use of schema information.

48 Logical plan

49 Physical plan

50 Data sources CSV JSON Parquet ORC JDBC/ODBC connections Plain text files Numerous community-created data sources including: Cassandra HBase MongoDB AWS Redshift...and many others.

51 Reading DataFrameReader (via DataFrames read method) we specify several values: the format (1), the schema (2), the read mode (3), a series of options (4) the path

52 Read modes What to do with malformed records? default

53 Parquet Apache Parquet is an open source column-oriented file format Columnar compression and allows for reading individual columns instead of entire files Recommended for writing to long-term storage More efficient than json or csv Another advantage of Parquet is that it supports complex types. That means that if your column is an array (which would fail with a csv file for example), map, or struct - you ll still be able to read and write that file without issue.

54 Row- & column-oriented Spark + Parquet In Depth, Emily Curtin and Robbie Strickland (Spark Summit East)

55 Parquet metadata

56 Writing DataFrameWriter - via DataFrames write method the format (1), the save mode (2), a series of options (3) and finally the path (4)

57 DataFrames: Pros and cons Pro: Good level of abstraction for tabular, relational data Schema allows Spark to optimize queries Con: Less suited for unstructured data (text) User Defined Functions can be clunky and unwieldy

58 Spark SQL Tables are logically equivalent to a DataFrame. The core difference between tables and DataFrames is that DataFrames are defined in the scope of a programming language, tables inside of a database. This means when you create a table (assuming you never changed the database), it will belong to the default database.

59 SparkSQL Use SparkSession s sql method create an SQL view which is tied to the SparkSession

60 SparkSQL

61 Save DataFrame as a Table

62 How tables are stored

63 Grouped data

64 Statistics pyspark.sql.functions contains many functions: statistics, date, math, list manipulation

65 Hands-on: Notebooks Jupyter notebooks: 02-spark-dataframes.ipynb

68 What a programmer wants To work with datasets that don t fit on a single machine Have the operations on them automatically parallelized Be able to use the full expressiveness of your language Have this work like the native collections API Don t need to worry about machine failure

Ghemawat. "MapReduce: simplified data processing on large clusters.

69 Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. Vol. 37. No. 5. ACM, Dean, J., and S. Ghemawat. "MapReduce: simplified data processing on large clusters. OSDI 04 Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation

70 The First Open Source Big Data Project Hadoop was started by Doug Cutting and Mike Cafarella Based on Google s concepts, written in Java Goal to power the Nutch web crawler Became a separate Apache project in 2006 Popular term for open-source big data projects: the Hadoop ecosystem

71 GFS/HDFS - design overview Handles failure of individual nodes Optimised for large (100+MB) files Optimised for sequential reads Favours high-throughput over low-latency

72 HDFS - architecture Files are split in 128MB blocks Blocks are stored on many machines (datanodes) Each block is stored three times (on three different nodes) Single namenode handles metadata (namespace, block locations) Clients connect directly to datanodes No updates of files

73 Writing to HDFS

74 Data locality In Hadoop the same machines are often used for both storage and compute The scheduler takes data location into account: tries to schedule tasks on the same machine as the data

76 Functional programming Restrict the programming interface so that the system can do more automatically. Use ideas from functional programming: Here is a function, apply it to all of the data I do not care where it runs (the system should handle that) Feel free to run it twice on different nodes (no side effects!)

77 Functions instead of iterations

81 Hadoop MapReduce

83 MapReduce strengths MapReduce framework handles a lot of work for its end-user: Splitting work in independent tasks Task scheduling, retrying on failure Data grouping/shuffling, in-memory/spilling to disk

84 MapReduce limitations Very low level: decomposing problems in (multiple) MapReduce jobs is hard Batch-oriented: unsuited for interactive use or realtime processing Disk sync: performance issues when chaining jobs (iterative algorithms)

85 Higher Level Frameworks = SQL on Hadoop = Dataflow API in Java = Pig - dataflow DSL = Graph processing All translated to MR jobs

88 Creation of an RDD Transforming an existing RDD Through SparkContext From an internal data structure: From reading in file (HDFS or otherwise): text = This is a sample text text_rdd = sc.parallelize(text) lines = sc.textfile('../data/links.tsv')

89 Example pyspark SparkContext RDD creation wordslist = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole'] wordsrdd = sc.parallelize(wordslist) filteredrdd = wordsrdd.filter(lambda x: len(x) >3) Laziness: Spark does nothing!

90 Partitions Like DataFrames, RDDs are physically divided in partitions Each partition contains a number of records Determined by the data source partitions can be processing in parallel, processing within a partition is done sequentially

91 Operations on RDDs Actions: Return some value or side-effect Triggers computation Example: count, saveastextfile Transformations: Create new RDD Lazily computed Example: map, filter

92 Transformations RDDs are created from other RDDs using transformations: map(f) => pass every element through function f reducebykey(f) => aggregate values with same key using f

93 Transformations (all lazy) RDDs are created from other RDDs using transformations: map(f) Apply function f to each element of the RDD flatmap(f) Apply function f to each element of the RDD and unpack lists etc. filter(pred) Apply predicate pred to each element RDD and return those that pass pred. dictinct() => remove duplicate entries in RDD

94 Example pyspark wordslist = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole'] wordsrdd = sc.parallelize(wordslist) filteredrdd = wordsrdd.filter(lambda x: len(x) >3) result = filteredrdd.count() 6 Action: Spark acts!

95 Map Pattern: Given a Python function f(x)that works on a single element x tokenize("i like traffic lights") [I, like, traffic, lights] Apply f to all elements in the rdd myrdd.map(tokenize)

96 map vs flatmap

99 Example reduce Function with two arguments

100 Pair RDDs The elements of a Pair RDD are pairs (tuples) (x,y) x is interpreted as the key, y as the value Very much like MapReduce Pair RDD have extra methods

101 Pair RDD transformations groupbykey() Returns a RDD with elements (key, valuelist) reducebykey(f(x,y)) Applies f to all values of each key (similar to Hadoop MapReduce) join(rdd) Joins two RDDs on their keys mapvalues(f) Apply f to the values, not the keys of the RDD

102 Creating a pair RDD Transform a RDD to a Pair RDD Use map, as in: pairrdd = textrdd.map(lambda x: (x,1)) map lambda x: (x,1)

103 Word Count Input the cat sat on the mat the aardvark sat on the sofa Output the 4 cat 1 sat 2 on 2.

104 Word Count Input Output the cat sat on the mat the aardvark sat on the sofa aardvark 1 cat 1 mat 1 on 2 sat 2 sofa 1 the 4

105 lines = sc.textfile(file) words = lines.flatmap(lambda s: s.split()) pairs = words.map(lambda w: (w, 1)) counts = pairs.reducebykey(lambda x, y: x + y)

106

107

108

109

110 Pseudo set operations

111 Efficient & fault-tolerant Intermediate RDDs are computed only when needed > pipelining RDD partitions never needed are not computed > laziness Lineage information is stored for every RDD partition > reconstruction

112 DIY

113 What is Docker?

114 Hands-on: Notebooks Jupyter notebooks: 03-spark-rdds.ipynb

115 Why another one? Machine Learning: MLlib

116 Spark MLlib Scale: many data sets/models become too big for single machine Spark is good at training models in a distributed fashion Not so good in predicting with very low latency (overhead for startup Spark jobs)

117 Machine learning 1. Data exploration 2. Data preprocessing 3. Model training 4. Model evaluation 5. Model inspection

118 MLlib: Spark s Machine Learning library The Apache Spark core distribution includes a machine learning library since its inception called MLlib MLlib was based on the RDD API Spark 1.2 introduced a new package called spark.ml spark.ml is a high-level interface based on DataFrames Since Spark 2.0 both are called MLlib DataFrames API is the primary API RDD API is in maintenance mode RDD API expected to be deprecated in 2.3, removed in 3.0

119 MLlib data types MLlib (RDD) uses some numerical data types backed by Breeze Local vector Dense and sparse vectors of doubles Labeled point Local vector + a label, used by supervised learning algorithms Local matrix Dense and sparse matrices stored on a single machine Distributed matrix Row, column indices with double values stored in one or more RDDs

120 New in Spark 2.3: ImageSchema Representation for images based on OpenCV, imageschema = StructType([ StructField("mode", StringType(), False), StructField("origin", StringType(), True), StructField("height", IntegerType(), False), StructField("width", IntegerType(), False), StructField("nChannels", IntegerType(), False), StructField("data", BinaryType(), False) ])

121 MLlib Common machine learning algorithms on top of Spark: classification: SVM, Naive Bayes, Random Forests regression: logistic regression, decision trees, isotonic regression clustering: K-means, PIC, LDA collaborative filtering: alternating least squares dimensionality reduction: SVD, PCA

122 Alternatives to Spark MLlib These libraries can use Spark as a backend and have their own API Sparkling Water (H2O) - DL4J - Apache Mahout -

123 Hands-on: Notebooks Jupyter notebooks: 04-decision-trees.ipynb 05-random-forests.ipynb

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes