Apache Spark 2.0. Matei

Size: px

Start display at page:

Download "Apache Spark 2.0. Matei"

Beverley Dixon
6 years ago
Views:

1 Apache Spark 2.0 Matei

2 What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and R Streaming SQL ML Graph Large community of contributors Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation

3 Relationship with Apache Mesos The two projects go back a long time! 2008: Mesos started as UC Berkeley research project

6 Relationship with Apache Mesos The two projects go back a long time! 2008: Mesos started as UC Berkeley research 2009: Spark started as example framework for iterative apps About 3 hours drive from here

7 Apache Spark Vision 1) Concise, high-level API Functional programming in Scala, Python, Java, etc 2) Unified engine for big data processing Combines batch, interactive, and streaming

8 Motivation: Concise API Much of data analysis is exploratory / interactive Spark solution: Resilient Distributed Datasets (RDDs) Distributed collection abstraction with simple functional API lines = spark.textfile( hdfs://... ) points = lines.map(line => parsepoint(line)) points.filter(p => p.x > 100).count() // RDD[String] // RDD[Point] Higher level APIs: DataFrames, ML pipelines

9 Motivation: Unification MapReduce General batch processing Pregel Dremel Impala Storm Giraph Drill Presto S4... Specialized systems for new workloads Hard to manage, combine tune, in pipelines deploy

10 Motivation: Unification MapReduce General batch processing Pregel Dremel Impala Storm Giraph Drill Presto S4... Specialized systems for new workloads? Unified engine

11 Built-In Libraries SQL + DataFrames Streaming MLlib GraphX Spark Core (RDDs)

12 Combining Libraries # Load data using SQL ctx.jsonfile( tweets.json ).registertemptable( tweets ) val points = ctx.sql( select latitude, longitude from tweets ) # Train a machine learning model val model = KMeans.train(points, 10) # Apply it to a stream ctx.twitterstream(...).map(t => (model.predict(t.location), 1)).reduceByWindow( 5s, (a, b) => a+b)

13 Apache Spark 2.0 Next major release, coming in June In development since January of this year Remains highly compatible with Spark 1.X Small changes to reduce binary dependencies

14 Major Features in 2.0 Project Tungsten 5-10x speedups for structured APIs Structured Streaming higher-level streaming engine Machine Learning Model Export

15 Major Features in 2.0 Project Tungsten 5-10x speedups for structured APIs Structured Streaming higher-level streaming engine Machine Learning Model Export

16 What are Structured APIs? New set of high-level APIs (DataFrames and Datasets) that act on structured data, i.e. records with a known schema Enable much more efficient implementation Original API: Java functions on Java objects (hard to analyze!) Structured APIs: declarative operators on structured records

17 Example: DataFrames DataFrames hold rows with a known schema and offer relational operations on them through a DSL val c = new HiveContext() val users = c.sql( select * from users ) val mausers = users(users( state ) === MA ) mausers.count() Expression AST mausers.groupby( name ).avg( age ) mausers.as[user].map(u => u.name.toupper) // Dataset[User] // Dataset[String]

18 What Structured APIs Enable 1. Compact binary representation Compressed columnar format; storage outside Java heap 2. Optimization across operators (join ordering, pushdown, etc) 3. Runtime code generation

19 Space Usage

20 Performance DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala Aggregation benchmark (s) 20

21 New in 2.0 Whole-stage code generation Fuse across multiple operators Spark 1.6 Spark M rows/s 125M rows/s Optimized input / output Apache Parquet + built-in cache Parquet in 1.6 Parquet in M rows/s 90M rows/s Automatically applies to SQL, DataFrames & Datasets

22 Major Features in 2.0 Project Tunsgten 5-10x speedups for structured APIs Structured Streaming higher-level streaming engine Machine Learning Model Export

23 Structured Streaming High-level streaming API built on DataFrames Event time, windowing, sessions, sources & sinks Also supports interactive & batch queries Aggregate data in a stream, then serve using JDBC Change queries at runtime Build and apply ML models Not just streaming, but continuous applications

24 Structured Streaming API Spark 1.X: Static DataFrames Spark 2.0: Infinite DataFrames Single API!

25 Example: Batch Aggregation logs = ctx.read.format("json").open("s3://logs") logs.groupby( userid, hour ).avg( latency ).write.format("jdbc").save("jdbc:mysql//...")

26 Example: Continuous Aggregation logs = ctx.read.format("json").stream("s3://logs") logs.groupby( userid, hour ).avg( latency ).write.format("jdbc").startstream("jdbc:mysql//...")

27 Incremental Execution Batch Scan Files Aggregate Automatically transformed by Spark engine Continuous Scan New Files Stateful Aggregate Write to MySQL Update MySQL

28 Major Features in 2.0 Project Tunsgten 5-10x speedups for structured APIs Structured Streaming higher-level streaming engine Machine Learning Model Export

29 ML Model Export I trained a great ML model but how can I call it in production? Model export allows saving & loading entire ML pipelines (including feature transformation steps) tinyurl.com/ml-persistence, SPARK-6725

30 Conclusion Apache Spark 2.0 continues goal of a unified, high level API for big data Part of a great ecosystem with Apache Mesos, Cassandra, Kafka, Try the {unfinished, unstable} 2.0 preview release: spark.apache.org

31 Want to Learn Apache Spark? Databricks Community Edition offers free hands-on tutorials databricks.com/ce.

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache