Introduction to Apache Spark

Size: px

Start display at page:

Download "Introduction to Apache Spark"

Arron Lewis
5 years ago
Views:

2016 yılı Yenilikçi ve Yaratıcı İstanbul

yürütülmekte olan TR10/16/YNY/0036 no lu

1 Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi ne ait olup İSTKA veya Kalkınma Bakanlığı nın görüşlerini yansıtmamaktadır.

2 A Major Step Backwards? MapReduce is a step backward in database access: Schemas are good Separation of the schema from the application is good High-level access languages are good MapReduce is poor implementation Brute force and only brute force (no indexes, for example) MapReduce is not novel MapReduce is missing features Bulk loader, indexing, updates, transactions MapReduce is incompatible with DMBS tools Source: Blog post by DeWitt and Stonebraker

3 Need for High-Level Languages MapReduce is great for one-pass computation large-data processing But writing Java programs for everything is verbose and slow Data scientists don t want to write Java But it is inefficient for multi-pass algorithms No efficient primitives for data sharing and iterative tasks State between steps goes to distributed file system Slow due to replication & disk storage

4 Move it onto a cluster In cluster setting, using of Map-Reduce and Hadoop is slow Hadoop writes to disk and complex How to improve this?

5 Example: Iterative Apps Input file system read file system write file system read file system write iter. 1 iter file system read query 1 query 2 result 1 result 2 Input query 3 result 3... Commonly spend 90% of time doing I/O

6 What we need is Resilient Checkpointing Fast, does not always store to disk Replayable Embarrassingly Parallel

7 Scala Scala has many other nice features: A type system that makes sense. Traits. Implicit conversions. Pattern Matching. XML literals, Parser combinators,...

8 Spark: A Brief History 2004 MapReduce paper 2010 Spark paper MapRe Google 2006 Yahoo! Hadoop Summit Apache Spark t op-level 8

9 Why better than hadoop? In-memory as opposed to disk Data can be cached in memory or disk for future use Fast: up to 100 times faster as it is using memory as opposed to disk Easier than Hadoop while being functional, runs a general DAG APIs in Java, Scala, Python, R

10 WordCount MapReduce vs Spark WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark

11 Word Count Example text_file = sc.textfile("hdfs://... wordslist.txt") wordcounts = text_file.flatmap(lambda line: line.split(" ")).map(lambda word:(word, 1 )).reducebykey(lambda x, y : x+y). c o l l e c t ( ) ) ('rat',2),('elephant',1),('cat',2)

12 Disk vs Memory L1 cache reference: L2 cache reference: Mutex lock/unlock: Main memory reference: Disk seek: 0.5 ns 7 ns 100 ns 100 ns 10,000,000 ns

13 Network vs Local Send 2K bytes over 1 Gbps network: Read 1 MB sequentially from memory: Round trip within same datacenter: Read 1 MB sequentially from network: Read 1 MB sequentially from disk: Send packet CA->Netherlands->CA: 20,000 ns 250,000 ns 500,000 ns 10,000,000 ns 30,000,000 ns 150,000,000 ns

14 Spark Architecture

15 Resilient Distributed Datasets (RDDs) RDD: Spark primitive representing a collection of records Immutable Partitioned (the D in RDD) Transformations operate on an RDD to create another RDD Coarse-grained manipulations only RDDs keep track of lineage Persistence RDDs can be materialized in memory or on disk OOM or machine failures: What happens? Fault tolerance (the R in RDD): RDDs can always be recomputed from stable storage (disk)

16 Resilient Distributed Datasets (RDDs) Resilient: Recover from errors, e.g. node failure, slow processes Track history of each partition, re-run

17 Input file Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter

18 Input file Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter

19 Resilient Distributed Datasets (RDDs) Distributed: Distributed across the cluster of machines Divided in partitions, atomic chunks of data

20 Resilient Distributed Datasets (RDDs) Dataset: Data storage created from: HDFS, S3, HBase, JSON, text, Local hierarchy of folders Or created transforming another RDD

21 Resilient Distributed Datasets (RDDs)» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T val sc = new SparkContext() val lines = sc.textfile("log.txt") // RDD[String] // Transform using standard collection operations val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split( \t )(2)) lazily evaluated messages.saveastextfile("errors.txt") kicks off a computation

22 Spark Driver and Workers driver program SparkContext sqlcontext A Spark program is two programs:» A driver program and a workers program Worker programs run on cluster nodes or in local threads Cluster manager Local threads DataFrames are distributed across workers Worker Spark executor Worker Spark executor Amazon S3, HDFS, or other storage

23 Spark Program 1) Create some input RDDs from external data or parallelize a collection in your driver program. 2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache() any intermediate RDDs that will need to be reused. 4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.

24 Transformations (lazy): map flatmap filter union/intersection join reducebykey groupbykey Operations on RDDs Actions (actually trigger computations) collect saveastextfile/saveassequencefile

25 Spark with Python

26 Deploying code to the cluster

27 Talking to Cluster Manager Manager can be: YARN Mesos Spark Standalone

28 RDD Stages Tasks RDD Objects DAG Scheduler Task Scheduler Worker DAG TaskSet Cluster manager Task Threads Block manager rdd1.join(rdd2).groupby( ).filter( ) split graph into stages of tasks launch tasks via cluster manager execute tasks build operator DAG submit each stage as ready retry failed or straggling tasks store and serve blocks

29 Where does code run? Local or Distributed?» Locally, in the driver» Distributed at the executors» Both at the driver and the executors» Transformations run at executors» Actions run at executors and driver Your application (driver program) Worker Spark executor Worker Spark executor Important point:» Executors run in parallel» Executors have much more memory

30 Spark Components

31 MLlib algorithms classification: logistic regression, linear SVM, naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

32 Machine Learning Library (MLlib) points = context.sql( select latitude, longitude from tweets ) model = KMeans.train(points, 10) 70+ contributors in past year

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds live data stream Spark Streaming Spark

33 Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds live data stream Spark Streaming Spark treats each batch of data as RDDs and processes them using RDD opera;ons Finally, the processed results of the RDD opera;ons are returned in batches batches of X seconds processed results Spark

34 Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency ~ 1 second live data stream Spark Streaming Poten;al for combining batch processing and streaming processing in the same system batches of X seconds processed results Spark

35 MLlib + SQL df = context.sql( select latitude, longitude from tweets ) model = pipeline.fit(df) DataFrames in Spark 1.3 (as of March 2015) Powerful coupled with new pipeline API

36 Spark SQL // Run SQL statements val teenagers = context.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are RDDs of Row objects val names = teenagers.map(t => "Name: " + t(0)).collect()

37 GraphX

38 GraphX General graph processing library Build graph using RDDs of nodes and edges Run standard algorithms such as PageRank

39 MLlib + GraphX

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How