Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Size: px

Start display at page:

Download "Apache Spark. Easy and Fast Big Data Analytics Pat McDonough"

Alban O’Neal’
5 years ago
Views:

1 Apache Spark Easy and Fast Big Data Analytics Pat McDonough

2 Founded by the creators of Apache Spark out of UC Berkeley s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark Community and Ecosystem Building Databricks Cloud

3 Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Big Data Analytics Where We ve Been 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to

4 Big Data Analytics Where We ve Been 2003 & Google GFS & MapReduce Papers are Precursors to Hadoop 2006 & Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

5 Big Data Analytics A Zoo of Innovation

6 Big Data Analytics A Zoo of Innovation

7 Big Data Analytics A Zoo of Innovation

8 Big Data Analytics A Zoo of Innovation

9 What's Working? Many Excellent Innovations Have Come From Big Data Analytics: Distributed & Data Parallel is disruptive... because we needed it We Now Have Massive throughput Solved the ETL Problem The Data Hub/Lake Is Possible

10 What Needs to Improve? Go Beyond MapReduce MapReduce is a Very Powerful But MapReduce Isn t Enough: and Flexible Engine Essentially Batch-only Processing Throughput Previously Unobtainable on Commodity Equipment Inefficient with respect to memory use, latency Too Hard to Program

11 What Needs to Improve? Go Beyond (S)QL SQL Support Has Been A Welcome Interface on Many Platforms And in many cases, a faster alternative But SQL Is Often Not Enough: Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don t want to build UDFs. Machine Learning (see above, plus iterative) Multi-step pipelines Often an Additional System

12 What Needs to Improve? Ease of Use Big Data Distributions Provide a number of Useful Tools and Systems Choices are Good to Have But This Is Often Unsatisfactory: Each new system has it s own configs, APIs, and management, coordination of multiple systems is challenging A typical solution requires stringing together disparate systems - we need unification Developers want the full power of their programming language

13 What Needs to Improve? Latency Big Data systems are throughput-oriented Some new SQL Systems provide interactivity But We Need More: Interactivity beyond SQL interfaces Repeated access of the same datasets (i.e. caching)

14 Can Spark Solve These Problems?

15 Apache Spark Originally developed in 2009 in UC Berkeley s AMPLab Fully open sourced in 2010 now at Apache Software Foundation

16 Project Activity June 2013 June 2014 total contributors companies contributing total lines of code 63, ,000

17 Project Activity June 2013 June 2014 total contributors companies contributing total lines of code 63, ,000

18 Compared to Other Projects Activity in past 6 months Commits Lines of Code Changed

19 Compared to Other Projects Spark is now the most active project in the 1200 Hadoop ecosystem Activity in past 6 months Commits Lines of Code Changed

20 Spark on Github So active on Github, sometimes we break it Over 1200 Forks (can t display Network Graphs) ~80 commits to master each week So many PRs We Built our own PR UI

21 Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: In-memory computing primitives General computation graphs Improved Usability: Rich APIs Interactive shell

22 Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: In-memory computing primitives General computation graphs Up to 100 faster (2-10 on disk) Improved Usability: Rich APIs 2-5 less code Interactive shell

23 Apache Spark - A SQL Machine Learning Streaming Graph Robust SDK for Big Core Data Applications Unified System With Libraries to Build a Complete Solution Very developer-friendly, Functional API for working with Data! Full-featured Programming Environment in Scala, Java, Python! Runtimes available on several platforms

24 Spark Is A Part Of Most Big Data Platforms All Major Hadoop Distributions Include Spark SQL Machine Learning Streaming Graph Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE Core Spark Applications Can Be Written Once and Deployed Anywhere Deploy Spark Apps Anywhere

Cassandra + Spark: A Great Combination Both are Easy to Use Spark Can Help You Bridge Your Hadoop and Cassandra Systems Use Spark Libraries, Caching on-top of

25 Cassandra + Spark: A Great Combination Both are Easy to Use Spark Can Help You Bridge Your Hadoop and Cassandra Systems Use Spark Libraries, Caching on-top of Cassandrastored Data Combine Spark Streaming with Cassandra Storage Datastax spark-cassandra-connector: spark-cassandra-connector

contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.

26 Easy: Get Started Immediately Interactive Shell Multi-language support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

27 Easy: Clean API Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations Operations Transformations (e.g. map, filter, groupby) Actions (e.g. count, collect, save) Automatically rebuilt on failure

28 Easy: Expressive API map reduce

29 Easy: Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

30 Easy: Example Word Count Hadoop MapReduce Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1); private Text word = new Text();! public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } }! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appname, [sparkhome], [jars]) val file = spark.textfile("hdfs://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...")

31 Easy: Example Word Count Hadoop MapReduce Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {!! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } }! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appname, [sparkhome], [jars]) val file = spark.textfile("hdfs://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...")

32 Easy: Works Well With Hadoop Data Compatibility Access your existing Hadoop Data Use the same data formats Adheres to data locality for efficient processing Deployment Models Standalone deployment YARN-based deployment Mesos-based deployment Deploy on existing Hadoop cluster or side-by-side!

33 Example: Logistic Regression data = spark.textfile(...).map(readpoint).cache()! w = numpy.random.rand(d)! for i in range(iterations):! gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x).reduce(lambda x, y: x + y) w -= gradient print Final w: %s % w

34 Fast: Using RAM, Operator Graphs In-memory Caching Data Partitions read from RAM instead of disk Operator Graphs Scheduling Optimizations Fault Tolerance A: B: = RDD Stage 1 groupby F: = cached partition C: D: E: join Stage 2 map filter Stage 3

35 Fast: Logistic Regression Performance s / iteration Running Time (s) Hadoop Spark Number of Iterations first iteration 80 s further iterations 1 s

Fast: Scales Down Seamlessly 100 Execution time (s) 75 50 25 68.8414 58.0614 40.

36 Fast: Scales Down Seamlessly 100 Execution time (s) Cache disabled 25% 50% 75% Fully cached % of working set in cache

37 Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textfile.filter(lambda s: s.startswith( ERROR )).map(lambda s: s.split( \t )[2]) HDFS File filter Filtered RDD map Mapped RDD (func = startswith( )) (func = split(...))

38 How Spark Works

39 Working With RDDs

40 Working With RDDs textfile = sc.textfile( SomeFile.txt ) RDD

41 Working With RDDs textfile = sc.textfile( SomeFile.txt ) RDD RDD RDD RDD Transformations lineswithspark = textfile.filter(lambda line: "Spark in line)

42 Working With RDDs textfile = sc.textfile( SomeFile.txt ) RDD RDD RDD RDD Action Value Transformations lineswithspark.count() 74! lineswithspark.first() # Apache Spark lineswithspark = textfile.filter(lambda line: "Spark in line)

43 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

44 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Driver

45 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Driver

46 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) Driver

47 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) Driver

48 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver messages.filter(lambda s: mysql in s).count()

49 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver messages.filter(lambda s: mysql in s).count() Action

50 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Block 1 messages.filter(lambda s: mysql in s).count() Block 2 Block 3

51 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver tasks Block 1 tasks messages.filter(lambda s: mysql in s).count() tasks Block 2 Block 3

52 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Driver Block 1 Read HDFS Block Block 3 Read HDFS Block Block 2 Read HDFS Block

53 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() Driver Block 3 Cache 3 Process & Cache Data Block 1 Block 2 Cache 1 Process & Cache Data Cache 2 Process & Cache Data

54 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver results Block 1 Cache 1 messages.filter(lambda s: mysql in s).count() results Cache 3 results Cache 2 Block 2 Block 3

55 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Block 1 Cache 1 messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Cache 3 Block 2 Cache 2 Block 3

56 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver tasks Block 1 Cache 1 tasks messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() tasks Cache 3 Block 2 Cache 2 Block 3

57 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Driver Block 3 Cache 3 Process from Cache Block 1 Block 2 Cache 1 Process from Cache Cache 2 Process from Cache

58 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver results Block 1 Cache 1 messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() results Cache 3 results Cache 2 Block 2 Block 3

59 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda s: s.startswith( ERROR )) messages = errors.map(lambda s: s.split( \t )[2]) messages.cache() Driver Block 1 Cache 1 messages.filter(lambda s: mysql in s).count() messages.filter(lambda s: php in s).count() Cache 2 Cache your data Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from cache vs. 20s for on-disk Block 3 Cache 3 Block 2

60 Spark s Libraries SQL Machine Learning Streaming Graph Core

61 Spark SQL

62 What is Spark SQL? Out of the box APIs built on the same system SQL interfaces, SchemaRDDs, and a LINQ-like DSL for end users An optimizer framework for manipulating trees of relational operators. Native support for executing relational queries (SQL) in Spark. Optimized integration with external sources

63 SparkSQL Architecture

64 Relationship to Shark Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark Borrows Hive data loading code / in-memory columnar representation hardened spark execution engine Adds RDD-aware optimizer / query planner execution engine language interfaces.

65 Hive Compatibility Interfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the Hive MetaStore o o Tablescan operator that uses Hive SerDes Wrappers for Hive UDFs, UDAFs, UDTFs

66 Parquet Support Native support for reading data stored in Parquet: Columnar storage avoids reading unneeded data. Nested Data support RDDs can be written to parquet files, preserving the schema. Predicate push-down support

67 JSON Support Native support for reading data stored in JSON:! Schema-inference through sampling Nested data support

68 Built-in Driver JDBC available OOTB as of Spark 1.1

69 Optimizations In addition to the standard Spark framework s optimizations Predicate push-down Partition pruning Code gen Automatic Broadcasts (based on statistics)

70 Example: SparkSQL, Core APIs, and MLlib Working Together val trainingdatatable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userid = e.userid""")// Since `sql` returns an RDD, the results of can be easily used in MLlib val trainingdata = trainingdatatable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)

71 Recent Roadmap Updates Performance and Usability Improvements Disk spilling for skewed blocks during cache operations Disk spilling during aggregations for PySpark sort-based shuffle usability improvements for monitoring the performance long running or complex jobs

72 Recent Roadmap Updates SparkSQL JDBC/ODBC server built-in Support for loading JSON data directly into Spark s SchemaRDD format, including automatic schema inference. Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation. This release also adds support for registering Python, Scala, and Java lambda functions as UDF Spark 1.1 adds a public types API to allow users to create SchemaRDD s from custom data sources. Many, many optimizations (Parquet-specific, cost-based

73 Recent Roadmap Updates MLlib New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets ) Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling). Support for nonnegative matrix factorization and SVD via Lanczos. Decision tree algorithm has been added in Python and Java. Tree aggregation primitive Performance improves across the board, with improvements of around

74 Recent Roadmap Updates Spark Streaming New data source for Amazon Kinesis Apache Flume: a new pull-based mode (simplifying deployment and providing high availability) The first of a set of streaming machine learning algorithms is introduced with streaming linear regression. Rate limiting has been added for streaming inputs

75 Thank You! Visit Blogs, Tutorials and more! Questions?

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory