Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

Size: px

Start display at page:

Download "Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect"

Darleen Gibbs
5 years ago
Views:

1 Olivia Klose Technical Evangelist Sascha Dittmann Cloud Solution Architect

3 What is Apache Spark? Apache Spark is a fast and general engine for large-scale data processing. An unified, open source, parallel, data processing framework for Big Data Analytics

4 What is Apache Spark? Speed Ease of Use Generality Runs Everywhere

5 What is Apache Spark? Speed Ease of Use Generality Runs Everywhere Hadoop Logistic Regression Spark 0.9 Logistic regression on a 100-node cluster with 100 GB of data

6 What is Apache Spark? Speed Ease of Use Generality Runs Everywhere text_file = spark.textfile("hdfs://...") text_file.flatmap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) Word count in Spark's Python API

7 What is Apache Spark? Speed Ease of Use Generality Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph SparkR Runs Everywhere Core

8 What is Apache Spark? Speed Ease of Use Generality Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph SparkR Runs Everywhere Core Yarn Mesos Standalone

9 Apache Spark in the Community

11 Scenarios Stream Processing Machine Learning Interactive Analytics Data Integration

12 Unifying Data Sources Traditional Data Warehouse ETL & Query Just-in-Time Data Warehouse Stream/Cache & Query Data Warehouse ETL RAM RAM RAM Data Source A Data Source B Data Source C Data Source A Data Source B Data Source C

13 Unifying Data Sources Traditional Data Warehouse Download & Play Just-in-Time Data Warehouse Stream/Cache & Play

14 Unifying Data Processing First cellular phones Specialized devices Smartphone (Unified Device)

15 Unifying Data Processing First cellular phones Specialized devices Smartphone (Unified Device) Better Games Better Phone Better GPS

16 Unifying Data Processing Spark is the smart phone of Big Data Batch Processing Specialized Systems Unified System

17 Unifying Data Processing Spark is the smart phone of Big Data Unified System Real-time analytics Instant fraud detection Better Apps

20 Spark Stack Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph SparkR R on Spark Core Yarn Mesos Standalone

21 Storage Options Azure Data Lake

22 Resilient Distributed Datasets (RDDs) transformations RDD RDD RDD RDD RDD actions Value

25 Hadoop vs. Spark: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map(longwritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(integer.parseint(fields[1])) context.write(one, output) } data = sc.textfile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(intwritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.write(key, average) }

26 What can Hadoop give to Spark? YARN Distributed File System Disaster Recovery Data Security

27 What can Spark give to Hadoop? Read from HDFS Write to HDFS Read from HDFS Write to HDFS Read from HDFS

29 DataFrame 1. A distributed collection of rows organized into named columns 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) 3. Archaic: Previously SchemaRDD (cf. Spark<1.3)

30 DataFrame

31 RDD vs. DataFrame: Compute an Average Using RDDs data = sc.textfile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \.reducebykey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \.map(lambda x: [x[0], x[1][0] / x[1][1]]) \.collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using DataFrames sqlctx.table("people") \.groupby("name") \.agg("name", avg("age")) \.map(lambda ) \.collect()

33 What else is there? Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph SparkR R on Spark Core Yarn Mesos Standalone

34 Streaming with Azure Event Hub Azure Event Hub HDInsight Spark Streaming Power BI 54

37 Weiterführende Informationen Entwickler: - News, Ressourcen, Events und Support für Entwickler - MSDN Flash kostenloser Newsletter für Entwickler IT Pros: - News, Ressourcen, Events und Support für IT Profis - TechNet Flash - kostenloser Newsletter für IT Profis Für Devs und IT Pros: - Kostenlose Online-Schulungen für Entwickler und IT Profis - Videoplattform für Entwickler und IT Profis

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students