Big Data Infrastructures & Technologies

Size: px

Start display at page:

Download "Big Data Infrastructures & Technologies"

Phoebe Lawson
5 years ago
Views:

1 Big Data Infrastructures & Technologies Spark and MLLIB

2 OVERVIEW OF SPARK

3 What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Scala, Java, Python Interactive shell

4 What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Up to 100 faster (2-10 on disk)

5 What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Up to 100 faster (2-10 on disk) Often 5 less code

6 The Spark Stack Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark SQL Spark Streaming (real-time) GraphX (graph) MLIB (machine learning) Spark More details: amplab.berkeley.edu

7 Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: More complex, multi-pass analytics (e.g. ML, graph) More interactive ad-hoc queries More real-time stream processing All 3 need faster data sharing across parallel jobs

8 Data Sharing in MapReduce Input iter. 1 iter. 2...

9 Data Sharing in MapReduce read iter. 1 iter Input

10 Data Sharing in MapReduce read write iter. 1 iter Input

11 Data Sharing in MapReduce read write read iter. 1 iter Input

12 Data Sharing in MapReduce read write read write iter. 1 iter Input

13 Data Sharing in MapReduce read write read write iter. 1 iter Input read query 1 query 2 result 1 result 2 Input query 3... result 3

14 Data Sharing in MapReduce read write read write iter. 1 iter Input read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication, serialization, and disk IO

15 Data Sharing in Spark Input iter. 1 iter one-time processing query 1 query 2 Input Distributed memory query 3... ~10 faster than network and disk

16 Spark Programming Model Key idea: resilient distributed datasets (RDDs) Distributed collections of objects that can be cached in memory across the cluster Manipulated through parallel operators Automatically recomputed on failure Programming interface Functional APIs in Scala, Java, Python Interactive use from Scala shell

17 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

18 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Driver

19 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Driver

20 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base RDD Driver

21 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Driver

22 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) Driver

23 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Transformed RDD errors = lines.filter(lambda x: x.startswith( ERROR )) Driver

24 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) Driver

25 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) Driver

26 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Driver

27 Lambda Functions errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) Lambda function functional programming! = implicit function definition

28 Lambda Functions errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) Lambda function functional programming! = implicit function definition bool detect_error(string x) { } return x.startswith( ERROR );

29 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Base Transformed RDD RDD Driver

30 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Base Transformed RDD RDD Driver messages.filter(lambda x: foo in x).count

31 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Base Transformed RDD RDD Driver messages.filter(lambda x: foo in x).count Action

32 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Base Transformed RDD RDD Driver messages.filter(lambda x: foo in x).count

33 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Driver Block 1 messages.filter(lambda x: foo in x).count Block 2 Block 3

34 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count Block 2 Block 3

35 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count Block 2 Block 3

36 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count Block 2 Block 3

37 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count Cache 2 Cache 3 Block 2 Block 3

38 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Driver Block 1 messages.filter(lambda x: foo in x).count Cache 2 Cache 3 Block 2 Block 3

39 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) messages.cache() Driver Block 1 messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count Cache 2 Cache 3 Block 2 Block 3

40 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count Cache 2 Cache 3 Block 2 Block 3

41 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count Cache 2 Cache 3 Block 2 Block 3

42 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count Cache 2 Cache 3 Block 2 Block 3

43 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count... Cache 2 Cache 3 Block 2 Block 3

44 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count... Cache 2 Cache 3 Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

45 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(lambda x: x.startswith( ERROR )) results Cache 1 messages = errors.map(lambda x: x.split( \t )[2]) tasks messages.cache() Block 1 Driver messages.filter(lambda x: foo in x).count messages.filter(lambda x: bar in x).count... Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Cache 3 Cache 2 Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

46 Input file Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter

47 Input file Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter

48 Input file Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y).filter(lambda (type, count): count > 10) map reduce filter

49 Running Time (s) Example: Logistic Regression Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s

50 Spark in Scala and Java // Scala: val lines = sc.textfile(...) lines.filter(x => x.contains( ERROR )).count() // Java: JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count();

51 Supported Operators map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

Software Components Spark client is library in user

cluster Mesos, YARN, standalone mode Accesses storage

S3, Spark executor Your application Cluster manager

52 Software Components Spark client is library in user program (1 instance per app) Runs tasks locally or on cluster Mesos, YARN, standalone mode Accesses storage systems via Hadoop InputFormat API Can use HBase,, S3, Spark executor Your application Cluster manager SparkContext Spark executor Local threads or other storage

53 Task Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles A: B: Stage 1 groupby C: D: E: F: join Stage 2 map filter Stage 3 = RDD = cached partition

54 Spark SQL Columnar SQL analytics engine for Spark Support both SQL and complex analytics Up to 100X faster than Apache Hive Compatible with Apache Hive HiveQL, UDF/UDAF, SerDes, Scripts Runs on existing Hive warehouses In use at Yahoo! for fast in-memory OLAP

55 Hive Architecture Client CLI JDBC Driver Hive Catalog SQL Parser Query Optimizer Physical Plan Execution MapReduce

56 Spark SQL Architecture Client CLI JDBC Hive Catalog SQL Parser Driver Query Optimizer Cache Mgr. Physical Plan Execution Spark [Engle et al, SIGMOD 2012]

57 What Makes it Faster? Lower-latency engine (Spark OK with 0.5s jobs) Support for general DAGs Column-oriented storage and compression New optimizations (e.g. map pruning)

58 Other Spark Stack Projects Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7) sc.twitterstream(...).flatmap(_.gettext.split( )).map(word => (word, 1)).reduceByWindow( 5s, _ + _) MLlib: Library of high-quality machine learning algorithms (out since 0.8)

59 Impala (mem) Redshift Spark SQL (mem) Response Time (s) Impala (disk) Spark SQL (disk) Performance SQL

60 Impala (mem) Redshift Spark SQL (mem) Response Time (s) Throughput (MB/s/node) Storm Impala (disk) Spark SQL (disk) Spark Performance SQL 0 Streaming

61 Impala (mem) Redshift Spark SQL (mem) Giraph GraphX Response Time (s) Throughput (MB/s/node) Storm Response Time (min) Impala (disk) Spark SQL (disk) Spark Hadoop Performance SQL 0 Streaming 0 Graph

62 ETL train query What it Means for Users Separate frameworks: read write read write read write

63 ETL train query ETL train query What it Means for Users Separate frameworks: read write read write read write Spark: read

64 ETL train query ETL train query What it Means for Users Separate frameworks: read write read write read write Spark: read

65 ETL train query ETL train query What it Means for Users Separate frameworks: read write read write read write Spark: read

66 Conclusion Big data analytics is evolving to include: More complex analytics (e.g. machine learning) More interactive ad-hoc queries More real-time stream processing Spark is a fast platform that unifies these apps More info: spark-project.org

67 SPARK MLLIB

68 What is MLLIB? MLlib is a Spark subproject providing machine learning primitives: initial contribution from AMPLab, UC Berkeley shipped with Spark since version 0.8

69 What is MLLIB? Algorithms: classification: logistic regression, linear support vector machine (SVM), naive Bayes regression: generalized linear regression (GLM) collaborative filtering: alternating least squares (ALS) clustering: k-means decomposition: singular value decomposition (SVD), principal component analysis (PCA)

70 Collaborative Filtering

71 Alternating Least Squares (ALS)

72 Collaborative Filtering in Spark MLLIB trainset = sc.textfile("s3n://bads-music-dataset/train_*.gz").map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) model = ALS.train(trainset, rank=10, iterations=10) # train testset = # load testing set sc.textfile("s3n://bads-music-dataset/test_*.gz").map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) # apply model to testing set (only first two cols) to predict predictions = model.predictall(testset.map(lambda p: (p[0], p[1]))).map(lambda r: ((r[0], r[1]), r[2]))

73 Spark MLLIB ALS Performance System Matlab Mahout 4206 GraphLab 291 MLlib 481 Dataset: Netflix data Cluster: 9 machines. Wall-clock /me (seconds) MLlib is an order of magnitude faster than Mahout. MLlib is within factor of 2 of GraphLab.

74 Spark Implementation of ALS s load data Models are instantiated at workers. At each iteration, models are shared via join between workers. Good scalability. Works on large datasets Master s

75 Spark SQL + MLLIB

MLLIB Pointers Website: http://spark.apache.org Tutorials: http://ampcamp.berkeley.

76 MLLIB Pointers Website: Tutorials: Spark Summit: Github: Mailing lists:

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer