Cloud Computing Application I: Virtualization, Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Size: px

Start display at page:

Download "Cloud Computing Application I: Virtualization, Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University"

Annabel Moody
5 years ago
Views:

1 Cloud Computing Application I: Virtualization, Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1

2 Outline Virtualization GFS/HDFS Mapreduce: Part I 2

3 IaaS 3

4 Service Model Software as a Service (SaaS) Use provider's applications over a network SalesForce.com, Gmail etc. Platform as a Service (PaaS) Deploy customer-created applications to a cloud AppEngine, Hadoop, Azure etc. Infrastructure as a Service (IaaS) Rent processing, storage, network Capacity and other fundamental computing resources Amazon EC2, S3 etc. 4

5 Service Model IaaS PaaS- SaaS Use So<ware Applica8on use Use Responsibilityboundary Applica8on-run8me-environment Computer-resource- e.g.-machine,-storage Hardware Policy Policy Policy Parts-related-with-the-cloud-service-provider Parts-related-with-the-cloud-user 5

6 Service Model 6

IaaS IaaS (Infrastructure as a Service): The

consumer is able to deploy and run arbitrary

7 IaaS IaaS (Infrastructure as a Service): The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. Example: 7

IaaS cloud provider guarantees qualified infrastructure services.

8 IaaS With IaaS: IaaS cloud provider takes care of all the IT infrastructure complexities. IaaS cloud provider provides all the infrastructure functionalities. IaaS cloud provider guarantees qualified infrastructure services. IaaS cloud provider charges clients according to their resource usage. Enabling techniques: Virtualization. 8

9 Virtualization 9

10 Virtualization Virtualization is an enabling technique to provide an abstraction of logical resources away from underlying physical resources. 10

11 Virtualization What is virtualization? Virtualization is the creation of a virtual (rather than physical) version of something, such as an operating system, a server, a storage device or network resources. It hides the physical characteristics of a resource from users, instead showing another abstract resource. But, where does virtualization come from? Virtualization is NOT a new idea of computer science. Virtualization concept comes from the component abstraction of system design, and it has been adapted in many system level. Now, let s take a look of our original system architecture!! 11

Layers of Abstraction System abstraction: Computer systems are built on levels of abstraction. Higher level of abstraction hide details at lower levels.

12 Layers of Abstraction System abstraction: Computer systems are built on levels of abstraction. Higher level of abstraction hide details at lower levels. Designer of each abstraction level make use of the functions supported from its lower level, and provide another abstraction to its higher one. Example Files are an abstraction of a disk. 12

13 Layers of Abstraction Machine level abstraction : For OS developers, a machine is defined by ISA (Instruction Set Architecture). This is the major division between hardware and software. Examples : CISC X86 RISC ARM RISC MIPS 13

14 Layers of Abstraction OS level abstraction : For compiler or library developers, a machine is defined by ABI (Application Binary Interface). This defines the basic OS interface which may be used by libraries or user. Examples : User ISA OS system call 14

15 Layers of Abstraction Library level abstraction : For application developers, a machine is defined by API (Application Programming Interface). This abstraction provides the well-rounded functionalities. Examples : User ISA Standard C library Graphical library 15

Layers of Abstraction The concept of virtualization is everywhere!! In IaaS, we focus the virtualization granularity at each physical hardware device.

16 Layers of Abstraction The concept of virtualization is everywhere!! In IaaS, we focus the virtualization granularity at each physical hardware device. General virtualization implementation level : Virtualized instance Software virtualized hardware instance Virtualization layer Software virtualization implementation Abstraction layer Various types of hardware access interface Physical hardware Various types of infrastructure resources Different physical resources : Server, Storage and Network. 16

Virtual Machine What is Virtual Machine (VM)? VM is a software implementation of a machine (i.e. a computer) that executes programs like a real machine.

17 Virtual Machine What is Virtual Machine (VM)? VM is a software implementation of a machine (i.e. a computer) that executes programs like a real machine. Terminology : Host (Target) The primary environment where will be the target of virtualization. Guest (Source) The virtualized environment where will be the source of virtualization. Virtualization involves the construction of an isomorphism ( 同形 ) from guest state to host state. 17

18 System Virtual Machine System virtual machine Provide the entire operating system on same or different host ISA (Instruction Set Architecture). Constructed at ISA level. Examples: XEN, KVM, VMWare. 18

19 Virtual Machine Monitor (VMM) What s Virtual Machine Monitor (VMM)? VMM or Hypervisor is the software layer providing the virtualization. System architecture: 19

20 Virtualization Types Virtualization Types : Type 1 Bare metal ( 原生虚拟化 ) VMMs run directly on the host's hardware as a hardware control and guest operating system monitor. Type 2 Hosted ( 寄宿虚拟化 ) VMMs are software applications running within a conventional operating system. 20

21 Virtualization Approaches Virtualization Approaches : Full-Virtualization ( 完全虚拟化 ) VMM simulates enough hardware to allow an unmodified guest OS. Para-Virtualization ( 部分虚拟化 ) VMM does not necessarily simulate hardware, but instead offers a special API that can only be used by the modified guest OS. 21

22 Virtualization Approaches Full-Virtualization 22

23 Virtualization Approaches Para-Virtualization ( 部分虚拟化 ) 23

24 Examples 24

25 CPU virtualization 25

26 Virtualization Technique Virtualization requirements from Popek and Goldberg : Popek and Goldberg provide a set of sufficient conditions for a computer architecture to efficiently support system virtualization. Popek and Goldberg provide guidelines for the design of virtualized computer architectures. In Popek and Goldberg terminology, a VMM must present all three properties : Equivalence (Same as real machine) 同质 Resource control (Totally control) 资源受控 Efficiency (Native execution) 高效 26

27 CPU Architecture Modern CPU status is usually classified as several modes. In general, we conceptually divide them into two modes: Kernel mode (Ring 0) CPU may perform any operation allowed by its architecture, including any instruction execution, IO operation, area of memory access, and so on. Traditional OS kernel runs in Ring 0 mode. User mode (Ring 1 ~ 3) CPU can typically only execute a subset of those available instructions in kernel mode. Traditional application runs in Ring 3 mode. 27

28 CPU Architecture By the classification of CPU modes, we divide instructions into following types : Privileged instruction ( 特权指令 ) Those instructions that trap if the machine is in user mode and do not trap if the machine is in kernel mode. Sensitive instructions ( 敏敏感指令 ) Those instructions that interact with hardware, which include controlsensitive and behavior-sensitive instructions. Innocuous instruction ( 无害指令 ) All other instructions. Critical instruction ( 临界指令 ) Those sensitive but not privileged instructions. 28

29 CPU Architecture X86 ISA RISC ISA 29

30 CPU Architecture What is trap ( 陷入 )? When CPU is running in user mode, some internal or external events, which need to be handled in kernel mode, take place. Then CPU will jump to hardware exception handler vector, and execute system operations in kernel mode. Trap types : System Call ( 系统调用 ) Invoked by application in user mode. For example, application ask OS for system IO. Hardware Interrupts ( 硬件中断 ) Invoked by some hardware events in any mode. For example, hardware clock timer trigger event. Exception ( 异常 ) Invoked when unexpected error or system malfunction occur. For example, execute privilege instructions in user mode. 30

31 Trap and Emulate Model Traditional OS : When application invoke a system call : CPU will trap to interrupt handler vector in OS. CPU will switch to kernel mode (Ring 0) and execute OS instructions. When hardware event : Hardware will interrupt CPU execution, and jump to interrupt handler in OS. 31

32 Trap and Emulate Model If we want CPU virtualization to be efficient, how should we implement the VMM? We should make guest binaries run on CPU as fast as possible. Theoretically speaking, if we can run all guest binaries natively, there will NO overhead at all. But we cannot let guest OS handle everything, VMM should be able to control all hardware resources. Solution : Ring Compression Shift traditional OS from kernel mode(ring 0) to user mode(ring 1), and run VMM in kernel mode. Then VMM will be able to intercept all trapping event. 32

33 Trap and Emulate Model VMM virtualization paradigm (trap and emulate) : Let normal instructions of guest OS run directly on processor in user mode. When executing privileged instructions, hardware will make processor trap into the VMM. The VMM emulates the effect of the privileged instructions for the guest OS and return to guest. 33

34 Trap and Emulate Model VMM and Guest OS : System Call CPU will trap to interrupt handler vector of VMM. VMM jump back into guest OS. Hardware Interrupt Hardware make CPU trap to interrupt handler of VMM. VMM jump to corresponding interrupt handler of guest OS. Privilege Instruction Running privilege instructions in guest OS will be trapped to VMM for instruction emulation. After emulation, VMM jump back to guest OS. 34

35 Virtualization Theorem Subset theorem : For any conventional third-generation computer, a VMM may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions. Recursive Emulation: A conventional third-generation computer is recursively virtualizable if It is virtualizable. VMM without any timing dependencies can be constructed for it. Under this theorem, x86 architecture cannot be virtualized directly. Other techniques are needed. 35

36 Virtualization Techniques How to virtualize unvirtualizable hardware : Para-virtualization ( 部分虚拟化 ) Modify guest OS to skip the critical instructions. Implement some hyper-calls to trap guest OS to VMM. Binary translation ( 二进制翻译 ) Use emulation technique to make hardware virtualizable. Skip the critical instructions by means of these translations. Hardware assistance Modify or enhance ISA of hardware to provide virtualizable architecture. Reduce the complexity of VMM implementation. Examples: Intel VT-x, AMD-V. 36

37 Hardware Assistance Intel VT-X: Include one more mode in X86 Root mode: Normal Instructions can run like the original mode; VMM runs in this mode Non-Root mode: All sensitive instructions are redefined; Sensitive instructions are trapped into Rood mode; Guest OS runs in this mode. 37

38 Hardware Assistance Intel VT-X: System call: Trapped into Guest OS. Hardware Interrupt: Handled by VMM. Sensitive instructions: Trapped into Non-root mode. 38

39 Memory Virtualization Memory management in OS Traditionally, OS fully controls all physical memory space and provide a continuous addressing space to each process. In server virtualization, VMM should make all virtual machines share the physical memory space without knowing the fact. Goals of memory virtualization: Address Translation Control table-walking hardware that accesses translation tables in main memory. Memory Protection Define access permission which uses the Access Control Hardware. Access Attribute Define attribute and type of memory region to direct how memory operation to be handled. 39

40 Memory Virtualization Memory virtualization architecture: 40

41 Memory Virtualization The performance drop of memory access is usually unbearable. VMM needs further optimization. VMM maintains shadow page tables: Direct virtual-to-physical address mapping. Use hardware Translation Lookaside Buffer (TLB) for address translation. 41

42 Shadow Page Table Map guest virtual address to host physical address Shadow page table Guest OS will maintain its own virtual memory page table in the guest physical memory frames. For each guest physical memory frame, VMM should map it to host physical memory frame. Shadow page table maintains the mapping from guest virtual address to host physical address. Page table protection VMM will apply write protection to all the physical frames of guest page tables, which lead the guest page table write exception and trap to VMM. 42

43 Shadow Page Table Construct shadow page table: Guest OS will maintain its own page table for each process. VMM maps each guest physical page to host physical page. Create shadow page tables for each guest page table. VMM should protect host frame which contains guest page table. 43

44 Other Issues Page fault and page protection issue When a physical page fault occurs, VMM need to decide whether this exception should be injected to guest OS or not. If the page entry in the page table of guest OS is still valid, VMM should prepare the corresponding page and not inject any exception to guest OS. If the page entry in the page table of guest OS is invalid either, then VMM should directly inject the virtual page fault to guest OS. When guest OS want to modify its page tables, VMM need to intercept this operation. When guest OS reload page table, CPU will trap to VMM due to the Ring Compression nature. VMM will walk the page table of guest OS and modify the related shadow page table to make MMU get host physical address. 44

45 Other Virtualization Server Virtualization CPU Virtualization Memory Virtualization I/O Virtualization Storage Virtualization Network Virtualization 45

46 PaaS 46

47 IaaS is not Enough IaaS provides many virtual or physical machines, but it cannot alter the quantity automatically. Consumers might: Require automatic make-decisions of dispatching jobs to available resources. Need a running environment or a development and testing platform to design their applications or services. 47

48 More Requirements Consumers require more and more... Large-scale resource abstraction and management Requirement of large-scale resources on demand Running and hosting environment Automatic and autonomous mechanism Distribution and management of jobs Access control and authentication... 48

49 PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. 49

50 PaaS Deliver the computing platform as a service Developing applications using programming languages and tools supported by the PaaS provider. Deploying consumer-created applications onto the cloud infrastructure. 50

51 Core Platform PaaS providers can provide a runtime environment for the developer platform. Runtime environment is automatic control such that consumers can focus on their services Dynamic provisioning On-demand resource provisioning Load balancing Distribute workload evenly among resources Fault tolerance Continuously operating in the presence of failures System monitoring Monitor the system status and measure the usage of resources. 51

52 PaaS PaaS Venders 52

53 Hadoop Distributed File System GFS vs HDFS Distributed Data Processing Mapreduce 53

Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data. Large-Scale Data Processing Want to use 1000s of CPUs.

54 Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data. Large-Scale Data Processing Want to use 1000s of CPUs. But don t want hassle of managing things. Storage devices fail 1.7% year 1-8.6% year 3(Google, 2007) 10,000 nodes, 7disks per node, year1,1190 failures/yr or 3.3failures/day. 54

55 Example: in Astronomy SKA Square Kilometer Array ( 平方公里里阵列列望远镜 ) Investment: $2 billion Data volume: over 12TB per second. 55

56 Motivation Data processing system provides: User-defined functions. Automatic parallelization and distribution. Fault tolerance. I/O scheduling. Status and monitoring. 56

Google Cloud Computing GFS (Google File System): Large Scale Distributed File System Paper: Interpreting the Data: Parallel Analysis with Sawzall (2005) Paper: MapReduce: Simplified Data Processing

57 Google Cloud Computing GFS (Google File System): Large Scale Distributed File System Paper: Interpreting the Data: Parallel Analysis with Sawzall (2005) Paper: MapReduce: Simplified Data Processing on Large Clusters (2004) White paper: The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines (2009) Web Search Programming Language Sawzall Parallel Programming Model MapReduce Distributed Database BigTable Distributed File System Google File System (GFS) Distributed lock System Application Service Log Analysis Chubby Gmail Datacenter Construction Google s platform Google Maps Paxos Paper: Bigtable: A Distributed Storage System for Structured Data (2006) Paper: The Google File System (2003) Paper: The Chubby lock service for looselycoupled distributed systems (2006) Paper: Failure Trends in a Large Disk Drive Population (2007) 57

58 Google Cloud Computing Google Technologies vs Open Source Technologies MapReduce BigTable Hadoop MapReduce HBase Google File System Hadoop Distributed File System Google s Technologies Open Source Technologies 58

59 Motivation: GFS Google needs a file system supporting storing massive data. Buy one (Probably including both software and hardware) Expensive! 59

60 Motivation: GFS Why not using the existing file system [2003]? Examples: RedHat GFS, IBM GPFS, Sun Lustre etc. The problem is different: Different workload Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common Running on commodity hardware. Compatible with Google s service. 60

61 Whether could we build a file system which runs on commodity hardware? 61

62 Motivation: GFS Design overview: The system is built from many inexpensive commodity components that often fail. The system stores a modest number of large files. The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. The workloads also have many large, sequential writes that append data to files. The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. High sustained bandwidth is more important than low latency. 62

63 Google File System Design of GFS Client: Implement file system API, communicate with master and chunk servers. Master: A single master maintains all file system metadata. Chunk Server: store data chunks on local disk as linux files. 63

64 Google File System Minimize the master s involvement in all operations. Decouple the flow of data from the flow of control to use the network efficiently. A large chunk size: 64MB Reduces clients need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. A client is more likely to perform many operations on a large chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Reduces the size of the metadata stored on the master. 64

65 Google File System Master operations: Namespace management and locking Replica placement By default, each chunk is replicated for 3 times. Creation, Re-replication, Rebalancing Garbage collection After a file is deleted, GFS does not immediately reclaim the available physical storage. Stale replica detection ( 过期的数据副本 ) The master removes stale replicas in its regular garbage collection. 65

66 HDFS The Apache Hadoop software library: A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. HDFS (Hadoop Distributed File System): A distributed file system that provides high-throughput access to application data. Written in Java programming language. 66

67 HDFS Namenode Corresponding to GFS s Master. Secondary Namenode Namenode s backup. Datanode Corresponding to GFS s Chunkserver. 67

68 Hadoop Hadoop Cluster 68

69 Heartbeat DataNode periodically NameNode 1. I am alive; 2. Blocks table. 69

70 HDFS Read RPC 70

71 HDFS Read Network topology and Hadoop 71

72 HDFS Write 72

HDFS Write HDFS replica placement A trade-off between reliability and write bandwidth and read bandwidth here. Default strategy: The first replica on the same node as the client.

73 HDFS Write HDFS replica placement A trade-off between reliability and write bandwidth and read bandwidth here. Default strategy: The first replica on the same node as the client. (random when client is out of the system) The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. 73

74 Mapreduce Jeff Dean 74

75 Motivation 75

76 Motivation Every search: 200+ CPU 200TB Data 0.1 second response time 5 revenue 76

77 Motivation Web data sets can be very large: Tens to hundreds of terabytes Cannot mine on a single server Data processing examples: Word Count Google Trends PageRank 77

78 Motivation Simple problem, difficult to solve: How to solve the problem within bounded time. Divide and Conquer! 78

79 Mapreduce Mapreduce: A programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of realworld tasks. Map: takes an input pair and produces a set of intermediate key/value pairs. Reduce: accepts an intermediate key I and a set of values for that key. It merges these values together to form a possibly smaller set of values. 79

80 Mapreduce 80

81 Mapreduce 81

82 Example: WordCount Input: Page 1: the weather is good Page 2: today is good Page 3: good weather is good 82

83 Example: WordCount map output: Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1). 83

84 Example: WordCount Input of Reduce: Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1) 84

85 Example: WordCount Reduce output: Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4) 85

86 86

87 clear 87

88 88

89 Hadoop Mapreduce Programming Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Maps input key/value pairs to a set of intermediate key/value pairs. map() method: Called once for each key/value pair in the input split. Most applications should override this. 89

90 Hadoop Mapreduce Programming Context object: Allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output. Applications can use the Context: to report progress to set application-level status messages update Counters indicate they are alive to get the values that are stored in job configuration across map/reduce phase. 90

91 Hadoop Mapreduce Programming Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases: Shuffle: Copy the sorted output from each Mapper using HTTP across the network. Sort: Sort Reducer inputs by keys. Reduce reduce() method is called. 91

92 Hadoop Mapreduce Programming reduce() method: This method is called once for each key. Most applications will define their reduce class by overriding this method. 92

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University PaaS and Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1 Outline PaaS Hadoop: HDFS and Mapreduce YARN Single-Processor Scheduling Hadoop Scheduling