Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Size: px

Start display at page:

Download "Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast"

Collin Jackson
5 years ago
Views:

1 Lightning Fast Cluster Computing Michael Armbrust Reflections Projections 2015 Michast

2 What is Apache? 2

3 What is Apache? Fast and general computing engine for clusters created by students at UC Berkeley Makes it easy to process large (GB-PB) datasets Support for Java, Scala, Python, R Libraries for SQL, streaming, machine learning, 100x faster than Hadoop Map/Reduce for some applications

4 Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Parallel functional transformations (map, filter, ) > Automatically rebuilt on failure

Example: Log Mining Load messages from a log file into memory, then interactively search for the

$startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.$ filter(lambda x: foo in x).count()

5 Example: Log Mining Load messages from a log file into memory, then interactively search for the problem lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Base Transformed RDD RDD Driver results tasks Worker Block 1 Cache 1 messages.filter(lambda x: foo in x).count() messages.filter(lambda x: bar in x).count()... Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 sec" in <1 (vs sec 170 (vs 20 sec sec for for on-disk on-disk data) data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

6 Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

7 Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter Input file

8 Speed-up ML Using Memory Running Time (s) Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s

9 On-Disk Sort Record: Time to sort 100TB 2013 Record: Hadoop 2100 machines 72 minutes 2014 Record: Spark 207 machines 23 minutes Also sorted 1PB in 4 hours Source: Daytona GraySort benchmark, sortbenchmark.org 9

10 Higher-Level Libraries Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph Spark

11 Seamlessly switch components // Load data using SQL points = ctx.sql( select latitude, longitude from tweets ) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterstream(...).map(lambda t: (model.predict(t.location), 1)).reduceByWindow( 5s, lambda a, b: a + b)

12 Powerful Stack Agile Development Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Powerful Stack Agile Development 140000 120000 100000 80000 60000 40000 20000 Streaming 0 Hadoop

13 Powerful Stack Agile Development Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

14 Powerful Stack Agile Development SparkSQL Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

15 Powerful Stack Agile Development GraphX SparkSQL Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

16 Powerful Stack Agile Development Your App? GraphX SparkSQL Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

17 Open Source Ecosystem Applications Environments Data Sources

18 Spark Community Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org

20 Get Involved on Check us out at Contribute code through Best way to get started is to fix a bug Don t forget to write a test!

21 About Databricks Founded by creators of Spark and remains largest contributor. The hardest part of using Spark is managing 100s of machines. Databricks makes this easy 21

22 Demo Using to analyze emojoi use on Twitter

23 What s next for?

24 + declarative programming Create and Running Spark Programs Faster: Write less code Read less data Let the optimizer do the hard work

25 DataFrame noun [dey-tuh-freym] 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).

$String[] fields = value.split("\t") output.set(integer.parseint(fields[1])) context.write(one, output) } data = sc.textfile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \.$

26 Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(integer.parseint(fields[1])) context.write(one, output) } data = sc.textfile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(intwritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.write(key, average) }

27 Write Less Code: Compute an Average Using RDDs data = sc.textfile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using DataFrames sqlctx.table("people") \.groupby("name") \.agg("name", avg("age")) \.collect()

28 Not Just Less Code: Faster Implementations DataFrame SQL DataFrame Python DataFrame Scala RDD Python RDD Scala Time to Aggregate 10 million int pairs (secs)

29 Machine Learning Pipelines tokenizer = Tokenizer(inputCol="text", outputcol="words ) hashingtf = HashingTF(inputCol="words", outputcol="features ) lr = LogisticRegression(maxIter=10, regparam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingtf, lr]) df = sqlctx.load("/path/to/data") model = pipeline.fit(df) lr df0 tokenizer df1 hashingtf df2 lr.model df3 Pipeline Model

30 Optimization happens as late as possible, therefore Spark SQL can optimize across functions. 30

31 def add_demographics(events): u = sqlctx.table("users") events \.join(u, events.user_id == u.user_id) \.withcolumn("city", ziptocity(df.zip)) # Load Hive table # Join on user_id # udf adds city column events = add_demographics(sqlctx.load("/data/events", "json")) training_data = events.where(events.city == Champaign").select(events.timestamp).collect() Logical Plan Physical Plan filter join only join relevant users join expensive scan (events) filter events file users table scan (users) 31

32 def add_demographics(events): u = sqlctx.table("users") events \.join(u, events.user_id == u.user_id) \.withcolumn("city", ziptocity(df.zip)) # Load partitioned Hive table # Join on user_id # Run udf to add city column events = add_demographics(sqlctx.load("/data/events", "parquet")) training_data = events.where(events.city == Champaign").select(events.timestamp).collect() Logical Plan Physical Plan Physical Plan with Predicate Pushdown and Column Pruning filter join join events file join users table scan (events) filter scan (users) optimized scan (events) optimized scan (users) 32

Plan Optimization & Execution Analysis Logical Optimization Physical Planning Code Generation SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical

33 Plan Optimization & Execution Analysis Logical Optimization Physical Planning Code Generation SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans Cost Model Selected Physical Plan RDDs Catalog DataFrames and SQL share the same optimization/execution pipeline Set Footer from Insert Dropdown Menu 33

34 Writing Rules as Tree Transformations 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Original Plan Project name Filter id = 1 Project id,name People Filter Push-Down Project name Project id,name Filter id = 1 People 34

35 Prior Work: " Optimizer Generators Volcano / Cascades: Create a custom language for expressing rules that rewrite trees of relational operators. Build a compiler that generates executable code for these rules. Cons: Developers need to learn this custom language. Language might not be powerful enough. 35

36 Filter Push Down Transformation val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 36

37 Filter Push Down Transformation Tree Partial Function val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 37

38 Filter Push Down Transformation Find Filter on Project val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 38

39 Filter Push Down Transformation val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } Check that the filter can be evaluated without the result of the project. 39

40 Filter Push Down Transformation val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } If so, switch the order. 40

41 Filter Push Down Transformation Scala: Pattern Matching val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 41

42 Filter Push Down Transformation Catalyst: Attribute Reference Tracking val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 42

43 Filter Push Down Transformation val newplan = queryplan transform { case Filter(_, Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } Scala: Copy Constructors 43

44 Optimizing with Rules Original Plan Filter Push-Down Combine Projection Physical Plan Project name Project name Filter id = 1 Project id,name Project name Project id,name People Filter id = 1 People Filter id = 1 People IndexLookup id = 1 return: name 44

45 Coming Soon: Datasets Type-safe: operate on domain objects with compiled lambda functions Fast: Code-generated encoders for fast serialization Interoperable: Easily convert DataFrames to Datasets without boiler plate val df = ctx.read.json("people.json") // Convert to custom objects. case class Person(name: String, age: Int) val ds: Dataset[Person] = df.as[person] ds.filter(_.age > 30) // Compute histogram of age by name. ds.groupby(_.name).mapgroups { case (name, people) => val buckets = Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) } 45

46 Questions?

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory