2/4/2019 Week 3- A Sangmi Lee Pallickara

Size: px

Start display at page:

Download "2/4/2019 Week 3- A Sangmi Lee Pallickara"

Baldric Turner
5 years ago
Views:

1 Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1 Your port ranges are announced in canvas Sangmi Lee Pallickara Computer Science, Colorado State University 2/4/2019 Colorado State University, Spring 2019 Week 3-A-2 2/4/2019 Colorado State University, Spring 2019 Week 3-A-3 Topics of Todays Class 3. Distributed Computing Models for Scalable Batch Computing Reduce II Introduction to Spark Quiz 1 Reduce Data Flow 2/4/2019 Colorado State University, Spring 2019 Week 3-A-4 2/4/2019 Colorado State University, Spring 2019 Week 3-A-5 Reduce data flow with a single reducer Reduce data flow with multiple reducers Split 0 copy Split 0 Merge Merge Reduce Part 0 HDFS Replication Split 1 Reduce Part 0 HDFS Replication Split 1 Merge Reduce Part 1 HDFS Replication Split 2 Split 2 Spring 2019 Colorado State University 1

2/4/2019 Colorado State University, Spring 2019 Week 3-A-6 2/4/2019 Colorado State University, Spring 2019 Week 3-A-7 How Reduce Works Programming components of Reduce Driver per Reducer InputFormat

2 2/4/2019 Colorado State University, Spring 2019 Week 3-A-6 2/4/2019 Colorado State University, Spring 2019 Week 3-A-7 How Reduce Works Programming components of Reduce Driver per Reducer InputFormat Combiner Partitioner OutputFormat Split 0 Split 1 Split 2 Split 3 Split 4 User Program 2/4/2019 Colorado State University, Spring 2019 Week 3-A-8 (4) read (2) (3) (2) Master (5) Local write 1. Shards the input files into M pieces 2. Starts up many copies of the program. (2) 3. Assign works (3) (6) 8. Wake up the user program 7. Local write Output file 0 Output file 1 2/4/2019 Colorado State University, Spring 2019 Week 3-A-9 Data locality optimization Hadoop tries to run the map task on a node where the input data resides in HDFS Minimizes usage of cluster bandwidth If all replication nodes are running other map tasks The job scheduler will look for a free map slot on a node in the same rack 6. Accesses the location notified by Master and perform reduce function Input files phase Intermediate files Reduce phase Output files (on local disks) 4. Read contents of the corresponding input shard Parse and passes the key-value pair to the function 5. Buffered pairs are written to local disk Location is reported to the Master and the Master forwards them to the reduce worker 2/4/2019 Colorado State University, Spring 2019 Week 3-A-10 2/4/2019 Colorado State University, Spring 2019 Week 3-A-11 Data movement in tasks Shuffle The process by which the system performs the and transfers the map outputs to the reducers as inputs Reduce guarantees that the input to every reducer is ed by key Spring 2019 Colorado State University 2

2/4/2019 Colorado State University, Spring 2019 Week 3-A-12 2/4/2019 Colorado State University, Spring 2019 Week 3-A-13 Combiner functions Minimize data transferred between map and reduce tasks Users

3 2/4/2019 Colorado State University, Spring 2019 Week 3-A-12 2/4/2019 Colorado State University, Spring 2019 Week 3-A-13 Combiner functions Minimize data transferred between map and reduce tasks Users can specify a combiner function To be run on the map output To replace the map output with the combiner output Combiner example Example (from the previous max temperature example) Without combiner The first map produced, (1950, 0), (1950, 20), (1950, 10) The second map produced, (1950, 25), (1950, 15) The reduce function is called with a list of all the values, (1950, [0, 20, 10, 25, 15]) Output will be, (1950, 25) We may express the function as, max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15)) = max(20, 25) = 25 2/4/2019 Colorado State University, Spring 2019 Week 3-A-14 2/4/2019 Colorado State University, Spring 2019 Week 3-A-15 Combiner example Example (from the previous max temperature example) With combiner The first map produced, (1950, 0), (1950, 20), (1950, 10). à (1950, 20) The second map produced, (1950, 25), (1950, 15) à (1950, 25) Combiner function Run a local reducer over output Reduce the amount of data shuffled between the mappers and the reducers The reduce function is called with a list of all the values, (1950, [20, 25]) Combiner cannot replace the reduce function Why? Output will be, (1950, 25) 2/4/2019 Colorado State University, Spring 2019 Week 3-A-16 2/4/2019 Colorado State University, Spring 2019 Week 3-A-17 Combiner function : Requirements Function should be commutative and associative Finding Maximum number Finding distribution Calculating Sum Finding an average Combiner function : Requirements Function should be commutative and associative Finding Maximum number (YES) Finding distribution (YES) Calculating Sum (YES) Finding an average (YES/NO): if your combiner deliver the count of items, it is still possible. Spring 2019 Colorado State University 3

4 2/4/2019 Colorado State University, Spring 2019 Week 3-A-18 2/4/2019 Colorado State University, Spring 2019 Week 3-A-19 YARN Framework YARN (Reduce 2) To provide the scalability to Reduce Splitting responsibility of the jobtracker Scheduling Task progress monitoring Reduce is one type of YARN application 2/4/2019 Colorado State University, Spring 2019 Week 3-A-20 2/4/2019 Colorado State University, Spring 2019 Week 3-A-21 YARN (Reduce 2) Resource manager Manages the use of resources across the cluster Node manager Launches and monitors the compute containers on machines in the cluster Application master Manages the lifecycle of applications running on the cluster Application master negotiates with the resource manager for cluster resources Number of container and certain memory limit Node managers oversee containers not to use more resources than allocated A Reduce job using YARN 2. Get new Application Reduce 1. Run job Job 4. Submit applicatiosn program Client JVM Client node 3. Copy job resources HDFS Node Manager 7. Retrieve 5b. input splits launch MR 6. Initialize job AppMaster Node manager node 10. Retrieve job resources 5.a. start container 8. Allocate resources 9. Start container Resource Manager Resource manager node 7. Heartbeat (returns task) Node Manager 9b. launch YARN Task JVM Child 11. run Task or reduce task Node manager node 2/4/2019 Colorado State University, Spring 2019 Week 3-A-22 Week 3-A-23 Progress and status updates CS535 BIG DATA Task reports its progress and status back to its application master Every 3 seconds over the umbilical interface The client polls the application master every second mapreduce.client.progressmonitor.pollinterval PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University Spring 2019 Colorado State University 4

2/4/2019 Colorado State University, Spring 2019 Week 3-A-24 2/4/2019 Colorado State University, Spring 2019 Week 3-A-25 In-Memory Cluster Computing: Apache Spark In-Memory Cluster Computing: Apache

5 2/4/2019 Colorado State University, Spring 2019 Week 3-A-24 2/4/2019 Colorado State University, Spring 2019 Week 3-A-25 In-Memory Cluster Computing: Apache Spark In-Memory Cluster Computing: Apache Spark Introduction 2/4/2019 Colorado State University, Spring 2019 Week 3-A-26 This material is built based on Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, The 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) 2/4/2019 Colorado State University, Spring 2019 Week 3-A-27 Distributed processing with the Spark framework API Spark Holden Karau, Andy Komwinski, Patrick Wendell and Matei Zaharia, Learning Spark, O Reilly, 2015 Spark programming guide Job Scheduling Cluster Computing Spark standalone YARN Mesos Storage HDFS/file system/ HBase/Cassandra, etc. 2/4/2019 Colorado State University, Spring 2019 Week 3-A-28 Inefficiencies for emerging applications: (1) Data reuse Data reuse is common in many iterative machine learning and graph algorithms e.g.pagerank, K-means clustering, and logistic regression 2/4/2019 Colorado State University, Spring 2019 Week 3-A-29 Inefficiencies for emerging applications: (2) Interactive data analytics User runs multiple ad-hoc queries on the same subset of the data Mv 0 v 1 Mv 1 v 2 Mv 2 v 3 Mv 3 v 4 Spring 2019 Colorado State University 5

6 2/4/2019 Colorado State University, Spring 2019 Week 3-A-30 2/4/2019 Colorado State University, Spring 2019 Week 3-A-31 Existing approaches Hadoop Writing output to an external stable storage system e.g. HDFS Substantial overheads due to data replication, disk I/O, and serialization Pregel Iterative graph computations HaLoop Iterative Reduce interface Pregel/HaLoop support specific computation patterns e.g. looping a series of Reduce steps A unified stack Spark contains multiple closely integrated components Spark core Computational engine Scheduling, distributing, and monitoring applications Spark Streaming Processes live streams of data MLlib Machine learning functionality ML algorithms (classification, regression, clustering and collaborative filtering) Model evaluation Data import 2/4/2019 Colorado State University, Spring 2019 Week 3-A-32 2/4/2019 Colorado State University, Spring 2019 Week 3-A-33 A unified stack GraphX Library for manipulating graphs Performs graph-parallel computations Extends the Spark RDD API Cluster Managers Spark can run over a variety of cluster managers Hadoop YARN, Apache Mesos, and Spark built-in cluster manager (Standalone scheduler) Running a simple example /* SimpleApp.scala */ import org.apache.spark.sql.sparksession object SimpleApp { def main(args: Array[String]) { val logfile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logdata = spark.read.textfile(logfile).cache() val numas = logdata.filter(line => line.contains("a")).count() val numbs = logdata.filter(line => line.contains("b")).count() println(s"lines with a: $numas, Lines with b: $numbs") spark.stop() } } Scala Tutorial: 2/4/2019 Colorado State University, Spring 2019 Week 3-A-34 sbt build file name := Simple Project" version := 1.0" scalaversion := " " // additional libraries librarydependencies += "org.apache.spark" %% "spark-sql" % 2.4.0" ) 2/4/2019 Colorado State University, Spring 2019 Week 3-A-35 Scala build and run # Your directory layout should look like this $ find.../build.sbt./src./src/main./src/main/scala./src/main/scala/simpleapp.scala # Package a jar containing your application $ sbt package... [info] Packaging {..}/{..}/target/scala-2.11/simple-project_ jar # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/scala-2.11/simple-project_ jar... Lines with a: 46, Lines with b: Spring 2019 Colorado State University 6

7 2/4/2019 Colorado State University, Spring 2019 Week 3-A-36 2/4/2019 Colorado State University, Spring 2019 Week 3-A-37 RDD (Resilient Distributed Dataset) Read-only, memory resident partitioned collection of records A fault-tolerant collection of elements that can be operated on in parallel In-Memory Cluster Computing: Apache Spark RDD (Resilient Distributed Dataset) RDDs are the core unit of data in Spark Most Spark programming involves performing operations on RDDs 2/4/2019 Colorado State University, Spring 2019 Week 3-A-38 2/4/2019 Colorado State University, Spring 2019 Week 3-A-39 Creating RDDs [1/3] Loading an external dataset val lines = sc.textfile("/path/to/readme.md") Creating RDDs [2/3] 1: val lines = sc.textfile("data.txt") 2: val linelengths = lines.map(s => s.length) 3: val totallength = linelengths.reduce((a, b) => a + b) Parallelizing a collection in your driver program val lines = sc.parallelize(list("pandas", "i like pandas")) Line 1: defines a base RDD from an external file This dataset is not loaded in memory Line 2: defines linelengths as the result of map transformation It is not immediately computed Line 3: performs reduce and compute the results 2/4/2019 Colorado State University, Spring 2019 Week 3-A-40 2/4/2019 Colorado State University, Spring 2019 Week 3-A-41 Creating RDDs [3/3] Spark Programming Interface to RDD [1/3] linelengths.persist() transformations 1: val lines = sc.textfile("data.txt") 2: val linelengths = lines.map(s => s.length) 3: val totallength = linelengths.reduce((a, b) => a + b) If you want to use linelengths again later Operations that create RDDs Return pointers to new RDDs e.g. map, filter, and join RDDs can only be created through deterministic operations on either Data in stable storage Other RDDs Spring 2019 Colorado State University 7

8 2/4/2019 Colorado State University, Spring 2019 Week 3-A-42 2/4/2019 Colorado State University, Spring 2019 Week 3-A-43 Spark Programming Interface to RDD [2/3] actions 1: val lines = sc.textfile("data.txt") 2: val linelengths = lines.map(s => s.length) 3: val totallength = linelengths.reduce((a, b) => a + b) Operations that return a value to the application or export data to a storage system e.g. count: returns the number of elements in the dataset e.g. collect: returns the elements themselves e.g. save: outputs the dataset to a storage system Spark Programming Interface to RDD [3/3] persist linelengths.persist() Indicates which RDDs they want to reuse in future operations Spark keeps persistent RDDs in memory by default If there is not enough RAM It can spill them to disk Users are allowed to, store the RDD only on disk replicate the RDD across machines specify a persistence priority on each RDD 2/4/2019 Colorado State University, Spring 2019 Week 3-A-44 Questions? Spring 2019 Colorado State University 8

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs. 10/22/2018 - FALL 2018 W10.A.0.0 10/22/2018 - FALL 2018 W10.A.1 FAQs Term project: Proposal 5:00PM October 23, 2018 PART 1. LARGE SCALE DATA ANALYTICS IN-MEMORY CLUSTER COMPUTING Computer Science, Colorado