PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

PaaS and Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1

Outline PaaS Hadoop: HDFS and Mapreduce YARN Single-Processor Scheduling Hadoop Scheduling Dominant-Resource Fair Scheduling 2

PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. 3

PaaS Deliver the computing platform as a service Developing applications using programming languages and tools supported by the PaaS provider. Deploying consumer-created applications onto the cloud infrastructure. 4

Core Platform PaaS providers can provide a runtime environment for the developer platform. Runtime environment is automatic control such that consumers can focus on their services Dynamic provisioning On-demand resource provisioning Load balancing Distribute workload evenly among resources Fault tolerance Continuously operating in the presence of failures System monitoring Monitor the system status and measure the usage of resources. 5

PaaS PaaS Venders 6

Hadoop Distributed File System GFS vs HDFS Distributed Data Processing Mapreduce 7

Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data. Large-Scale Data Processing Want to use 1000s of CPUs. But don t want hassle of managing things. Storage devices fail 1.7% year 1-8.6% year 3(Google, 2007) 10,000 nodes, 7disks per node, year1,1190 failures/yr or 3.3failures/day. 8

Example: in Astronomy SKA Square Kilometer Array ( 平方公里里阵列列望远镜 ) Investment: $2 billion Data volume: over 12TB per second. 9

Motivation Data processing system provides: User-defined functions. Automatic parallelization and distribution. Fault tolerance. I/O scheduling. Status and monitoring. 10

Google Cloud Computing GFS (Google File System): Large Scale Distributed File System Paper: Interpreting the Data: Parallel Analysis with Sawzall (2005) Paper: MapReduce: Simplified Data Processing on Large Clusters (2004) White paper: The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines (2009) Web Search Programming Language Sawzall Parallel Programming Model MapReduce Distributed Database BigTable Distributed File System Google File System (GFS) Distributed lock System Application Service Log Analysis Chubby Gmail Datacenter Construction Google s platform Google Maps Paxos Paper: Bigtable: A Distributed Storage System for Structured Data (2006) Paper: The Google File System (2003) Paper: The Chubby lock service for looselycoupled distributed systems (2006) Paper: Failure Trends in a Large Disk Drive Population (2007) 11

Google Cloud Computing Google Technologies vs Open Source Technologies MapReduce BigTable Hadoop MapReduce HBase Google File System Hadoop Distributed File System Google s Technologies Open Source Technologies 12

Motivation: GFS Google needs a file system supporting storing massive data. Buy one (Probably including both software and hardware) Expensive! 13

Motivation: GFS Why not using the existing file system [2003]? Examples: RedHat GFS, IBM GPFS, Sun Lustre etc. The problem is different: Different workload Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common Running on commodity hardware. Compatible with Google s service. 14

Whether could we build a file system which runs on commodity hardware? 15

Motivation: GFS Design overview: The system is built from many inexpensive commodity components that often fail. The system stores a modest number of large files. The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. The workloads also have many large, sequential writes that append data to files. The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. High sustained bandwidth is more important than low latency. 16

Google File System Design of GFS Client: Implement file system API, communicate with master and chunk servers. Master: A single master maintains all file system metadata. Chunk Server: store data chunks on local disk as linux files. 17

Google File System Minimize the master s involvement in all operations. Decouple the flow of data from the flow of control to use the network efficiently. A large chunk size: 64MB/128MB Reduces clients need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. A client is more likely to perform many operations on a large chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Reduces the size of the metadata stored on the master. 18

Google File System Master operations: Namespace management and locking Replica placement By default, each chunk is replicated for 3 times. Creation, Re-replication, Rebalancing Garbage collection After a file is deleted, GFS does not immediately reclaim the available physical storage. Stale replica detection ( 过期的数据副本 ) The master removes stale replicas in its regular garbage collection. 19

HDFS The Apache Hadoop software library: A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. HDFS (Hadoop Distributed File System): A distributed file system that provides high-throughput access to application data. Written in Java programming language. 20

HDFS Namenode Corresponding to GFS s Master. Secondary Namenode Namenode s backup. Datanode Corresponding to GFS s Chunkserver. 21

Hadoop Hadoop Cluster 22

Heartbeat DataNode periodically NameNode 1. I am alive; 2. Blocks table. 23

HDFS Read RPC 24

HDFS Read Network topology and Hadoop 25

HDFS Write 26

HDFS Write HDFS replica placement A trade-off between reliability and write bandwidth and read bandwidth here. Default strategy: The first replica on the same node as the client. (random when client is out of the system) The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. 27

Mapreduce Jeff Dean 28

Motivation 29

Motivation Every search: 200+ CPU 200TB Data 0.1 second response time 5 revenue 30

Motivation Web data sets can be very large: Tens to hundreds of terabytes Cannot mine on a single server Data processing examples: Word Count Google Trends PageRank 31

Motivation Simple problem, difficult to solve: How to solve the problem within bounded time. Divide and Conquer! 32

Mapreduce Mapreduce: A programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of realworld tasks. Map: takes an input pair and produces a set of intermediate key/value pairs. Reduce: accepts an intermediate key I and a set of values for that key. It merges these values together to form a possibly smaller set of values. 33

Mapreduce 34

Mapreduce 35

Example: WordCount Input: Page 1: the weather is good Page 2: today is good Page 3: good weather is good 36

Example: WordCount map output: Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1). 37

Example: WordCount Input of Reduce: Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1) 38

Example: WordCount Reduce output: Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4) 39

clear 41

Hadoop Mapreduce Programming Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Maps input key/value pairs to a set of intermediate key/value pairs. map() method: Called once for each key/value pair in the input split. Most applications should override this. 43

Hadoop Mapreduce Programming Context object: Allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output. Applications can use the Context: to report progress to set application-level status messages update Counters indicate they are alive to get the values that are stored in job configuration across map/reduce phase. 44

Hadoop Mapreduce Programming Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases: Shuffle: Copy the sorted output from each Mapper using HTTP across the network. Sort: Sort Reducer inputs by keys. Reduce reduce() method is called. 45

Hadoop Mapreduce Programming reduce() method: This method is called once for each key. Most applications will define their reduce class by overriding this method. 46

Apache Hadoop The project include these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop Distributed File System (HDFS) Hadoop MapReduce Other Hadoop-related projects at Apache include: Ambari Avro Cassandra (A scalable database) HBase (A distributed database) Hive (data summarization and ad hoc querying) Pig (data-flow language) Spark (A fast and general compute engine) Tez (execute an arbitrary DAG of tasks) Chukwa Zookeeper (coordination service) 47

What is YARN? Yet Another Resource Negotiator. Provides resource management services Scheduling Monitoring Control Replaces the resource management services of the JobTracker. Bundled with Hadoop 0.23 and Hadoop 2.x. 48

Why YARN? 49

Why YARN? Hadoop JobTracker was a barrier for scaling: Primary reason Hadoop 1.x is recommended for clusters no larger than 4000 nodes. Thousands of applications each running tens of thousands of tasks. JobTracker not able to schedule resources as fast as they became available. Distinct map and reduce slots led to artificial bottlenecks and low cluster utilization. 50

Why YARN? MapReduce was being abused by other application frameworks Frameworks trying to work around sort and shuffle. Iterative algorithms were suboptimal. YARN strives to be application framework agnostic ( 中立的, 不不可知的 ). Different application types can share the same cluster. Runs MapReduce out of the box as part of Apache Hadoop. 51

YARN High-Level Architecture ResourceManager Single, centralized daemon for scheduling containers. Monitors nodes and applications. NodeManager Daemon running on each worker node in the cluster. Launches, monitors, and controls containers. ApplicationMaster Provides scheduling, monitor, control for an application instance. RM launches an AM for each application submitted to the cluster. AM requests containers via RM; launches containers via NM. Containers Unit of allocation and control for YARN. ApplicationMaster and application-specific tasks run within containers. 52

YARN High-Level Architecture 53

Mapreduce on YARN 54

Mapreduce on YARN 55

Mapreduce on YARN 56

Mapreduce on YARN 57

Mapreduce on YARN 58

Mapreduce on YARN 59

Mapreduce on YARN 60

Scheduling 61

Why Scheduling? Multiple tasks to schedule The processes on a single-core OS. The tasks of a Hadoop job. The tasks of multiple Hadoop jobs. Limited resources that these tasks require Processor(s) Memory (Less contentious) disk, network Scheduling goals 1. Good throughput or response time for tasks (or jobs) 2. High utilization of resources. 62

Single Processor Scheduling 63

FIFO Scheduling/FCFS Maintain tasks in a queue in order of arrival. When processor free, dequeue head and schedule it. 64

FIFO/FCFS Performance Average completion time may be high. For our example on previous slides. Average completion time of FIFO/FCFS = (Task 1 + Task 2 + Task 3)/3 = (10+15+18)/3 = 43/3 = 14.33 65

STF Scheduling (Shortest Task First) Maintain all tasks in a queue, in increasing order of running time. When processor free, dequeue head and schedule. 66

STF Is Optimal! Average completion of STF is the shortest among all scheduling approaches! For our example on previous slides, Average completion time of STF = (Task 1 + Task 2 + Task 3)/3 = (18+8+3)/3 = 29/3 = 9.66 (versus 14.33 for FIFO/FCFS) In general, STF is a special case of priority scheduling. Instead of using time as priority, scheduler could use user-provided priority. 67

Round-Robin Scheduling Use a quantum (say 1 time unit) to run portion of task at queue head. Pre-empts processes by saving their state, and resuming later. After pre-empting, add to end of queue. 68

Round-Robin vs. STF/FIFO Round-Robin preferable for: Interactive applications. User needs quick responses from system. FIFO/STF preferable for Batch applications User submits jobs, goes away, comes back to get result. 69

Hadoop Scheduling Activities: Mappers, Reducers Resources: Tasktracker Scheduling goals: Time efficiency Scheduler: Jobtracker (MRv.1)/RM (YARN). Default scheduling algorithm: FIFO 70

FIFO in Hadoop Support 5 levels of priorities. Tasks are sorted according to their priority and submission time. Step 1: Select from the list of tasks with highest priority. Step 2: Select the task with earliest submission time in the list. Assign the selected task to a task tracker nearest to target data. 71

FIFO in Hadoop Improve data locality to reduce communications. Same% node Same% rack Remote% rack 72

FIFO in Hadoop A later submitted short task may have to wait a very long time if the previously submitted task is quite time-consuming. The$job$queue: User$1 User$2 User$3 User$4 73

Hadoop Fair Scheduler Job Scheduling for Multi-User MapReduce Clusters, 2009. Delay Scheduling : A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, M. Zaharia et al. Eurosys2010 74

Hadoop Fair Scheduler Design goals: Isolation Give each user (job) the illusion of owning (running) a private cluster. Statistical Multiplexing Redistribute capacity unused by some users (jobs) to other users (jobs). 75

Hadoop Fair Scheduler A two-level hierarchy. At the top level, FAIR allocates task slots across pools. Each pool will receive its minimum share. At the second level, each pool allocates its slots among multiple jobs in the pool. Pools 1 and 3 have minimum shares of 60, and 10 slots, respectively. Because Pool 3 is not using its share, its slots are given to Pool 2. Each user can decide his own internal scheduling algorithm (FIFO or Fair). 76

Hadoop Fair Scheduler d:#the#demand#capacity m:#the#minimum#share 77

Hadoop Fair Scheduler FAIR operates in three phases. Phase 1: It fills each unmarked bucket, i.e., it satisfies the demand of each bucket whose minimum share is larger than its demand. Phase 2: It fills all remaining buckets up to their marks. With this step, the isolation property is enforced as each bucket has received either its minimum share, or its demand has been satisfied. Phase 3: FAIR implements statistical multiplexing by pouring the remaining water evenly into unfilled buckets, starting with the bucket with the least water and continuing until all buckets are full or the water runs out. 78

Hadoop Fair Scheduler FAIR uses two timeouts: One for guaranteeing the minimum share (Tmin), One for guaranteeing the fair share (Tfair) Tmin < Tfair. If a newly started job does not get its minimum share before Tmin expires, FAIR kills other pools tasks and re-allocates them to the job. Then, if the job has not achieved its fair share by Tfair, FAIR kills more tasks. Pick the most recently launched tasks in over-scheduled jobs to minimize wasted computation. 79

Estimating Task Lengths HCS/HFS use FIFO May not be optimal (as we know!) Why not use shortest-task-first instead? It s optimal (as we know!) Challenge: Hard to know expected running time of task (before it s completed) Solution: Estimate length of task Some approaches Within a job: Calculate running time of task as proportional to size of its input Across tasks: Calculate running time of task in a given job as average of other tasks in that given job (weighted by input size) Lots of recent research results in this area! 80

Dominant-Resource Fair Scheduling Ali Ghodsi, Matei Zaharia, et al., Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. NSDI 2011 81

Challenge What about scheduling VMs in a cloud (cluster)? Jobs may have multi-resource requirements: Job 1 s tasks: 2 CPUs, 8 GB Job 2 s tasks: 6 CPUs, 2 GB How do you schedule these jobs in a fair manner? That is, how many tasks of each job do you allow the system to run concurrently? What does fairness even mean? 82

Dominant Resource Fairness (DRF) Proposed by researchers from U. California Berkeley. Proposes notion of fairness across jobs with multiresource requirements. They showed that DRF is: Fair for multi-tenant systems. Strategy-proof: tenant can t benefit by lying. Envy-free: tenant can t envy another tenant s allocations. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. NSDI 2011 83

Where is DRF Useful? DRF is: Usable in scheduling VMs in a cluster. Usable in scheduling Hadoop in a cluster. DRF used in Mesos, an OS intended for cloud environments. DRF-like strategies also used some cloud computing company s distributed OS s. 84

How DRF Works? Our example: Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM>. 85

How DRF Works? Our example Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM>. Each Job 1 s task consumes % of total CPUs = 2/18 = 1/9 Each Job 1 s task consumes % of total RAM = 8/36 = 2/9 1/9 < 2/9 => Job 1 s dominant resource is RAM, i.e., Job 1 is more memory- intensive than it is CPU-intensive. 86

How DRF Works? Our example Job 1 s tasks: 2 CPUs, 8 GB => Job 1 s resource vector = <2 CPUs, 8 GB> Job 2 s tasks: 6 CPUs, 2 GB => Job 2 s resource vector = <6 CPUs, 2 GB> Consider a cloud with <18 CPUs, 36 GB RAM> Each Job 2 s task consumes % of total CPUs = 6/18 = 6/18 Each Job 2 s task consumes % of total RAM = 2/36 = 1/18 6/18 > 1/18 => Job 2 s dominant resource is CPU, i.e., Job 2 is more CPUintensive than it is memory-intensive. 87

DRF Fairness For a given job, the % of its dominant resource type that it gets cluster-wide, is the same for all jobs: Job1 s % of RAM = Job2 s % of CPU Can be written as linear equations, and solved. 88

DRF Solution DRF Ensures Job1 s % of RAM = Job2 s % of CPU Solution for our example: Job 1 gets 3 tasks each with <2 CPUs, 8 GB> Job 2 gets 2 tasks each with <6 CPUs, 2 GB> Job1 s % of RAM: = Number of tasks * RAM per task / Total cluster RAM = 3*8/36 = 2/3 Job2 s % of CPU: = Number of tasks * CPU per task / Total cluster CPUs = 2*6/18 = 2/3 89

Other DRF Details DRF generalizes to multiple jobs. DRF also generalizes to more than 2 resource types CPU, RAM, Network, Disk, etc. DRF ensures that each job gets a fair share of that type of resource which the job desires the most. Hence fairness. 90

Summary Scheduling very important problem in cloud computing Limited resources, lots of jobs requiring access to these jobs Single-processor scheduling FIFO/FCFS, STF, Priority, Round-Robin Hadoop scheduling FIFO scheduler, Fair scheduler Dominant-Resources Fairness. 91