Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Size: px

Start display at page:

Download "Cloud Computing. Lectures 11, 12 and 13 Cloud Storage"

Judith Craig
5 years ago
Views:

1 Cloud Computing Lectures 11, 12 and 13 Cloud Storage

2 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling Map Reduce 2

3 Outline Components of Cloud Platforms Storage Types Storage Products Cloud File Systems Cloud Object Storage 3

4 Components of Cloud Computing Platforms Programming Model How to program an application? How is the platform viewed? Monitoring Execution Model Data Storage Which abstraction is accessible: VM? API? Framework? Which operations can I perform? How are my data stored and accessed? Monitoring: How can I evaluate the state of executions/nodes/data...? 4

5 Major Cloud Platforms Apache Hadoop Amazon Web Services Google App Engine Microsoft Azure OpenStack 5

6 Storage Types A range of search, streaming and indexing variants. File System: Hierarchical organization, files, permission, streaming data,... Object Storage: Direct Program <-> Storage interaction Object ID indexing Tables (no-sql DB): records and tables Search No relational model Relational Databases: Full relational model Conventional services We will see that the categories are becoming blurred... 6

7 Storage Products (i) File System Hadoop File System / Google File System Object/Byte Storage Amazon S3 MS Azure Blobs Table Hadoop HBase / Google Big Table (AppEngine Datastore) Amazon Simple DB MS Azure Tables Hadoop Hive Yahoo PNUTS Relational Databases Amazon RDS SQL Azure 7

8 Cloud File System: HDFS/GFS Distributed File System Reimplementation of the Google File System (GFS). Runs on clusters of generic machines. HDFS is tuned for: Very large files. Streaming access. Generic hardware. Scalability Key: data operations don t go through the central server. 8

9 Blocks Simplify space management: allocation, replication and a file may grow almost indefinitely. Evolution: Disk blocks: 512 bytes File system blocks: 2,4,8 kb HDFS blocks: 64MB To eliminate seek steps: contiguous 64MB. A file smaller than one block does not occupy 1 block. 9

10 Namenode Manages the file system name space: folder hierarchy, name uniqueness, Maintains the folder tree and the metadata in 2 files: namespace image and edit log. HDFS cannot operate without the namenode. Files can be written, read, renamed and deleted. It is not possible to: Write in the middle of the file. Write concurrently to the same file. Fault tolerance mechanism: atomic replication to another machine. 10

11 Datanode Manage a set of blocks. Process clients or namenode s writing/reading requests. Periodically notifies the namenodeof the blocks it holds.. If a block s replication factor drops below a configuration value, a new replica is created. 11

12 Permissions Permissions in HDFS are similar to UNIX: user, groupe other read, writee execute As the user is very often remote, any username from a remote node is trusted. Therefore, protection is weak. They are more geared towards managing a group of users in the cluster. 12

13 Consistency Model Formalization of the visibility of read and write operations. After an operation call finishes, who sees what and when? HDFS model: There are no guarantees that the last block has been written unless sync()is called. 13

14 Error Checking The block correction is checked using a hashing function (CRC32 - checksum). At file creation: Client calculates the checksum for each 512 byte block. Datanode stores the checksum. At file access: Client reads the data and the checksum from the datanode. If the check fails, it tries other replicas. Periodically, the datanodechecks its blocks checksum. 14

15 Reading Client contacts the namenodeto get the list of the datanodeswith the file s blocks (stored in memory). Receives a FSDataInputStreamthat transparently chooses the best datanode, opens and closes connections to the datanodes, requests blocks from the namenode, repeats operations if necessary and logs failed datanodes. 15

16 Reading 16

17 Choosing Nodes: Distance Nodes choose the closer sources of data. Assumes a tree structured organization. Distance equal to the name of hops between the tree nodes. distance(/d1/r1/n1, /d1/r1/n1) = 0(processes on the same node) distance(/d1/r1/n1, /d1/r1/n2) = 2(processes on the same racks) distance(/d1/r1/n1, /d1/r2/n3) = 4(processes on different racks) distance(/d1/r1/n1, /d2/r3/n4) = 6(processes on different datacentres) 17

18 Distance Between Nodes 18

19 Writing (+ creating) Client requests a new file to the namenode checking permission and uniqueness. If it succeeds, it receives a FSOutputStream. Namenodeprovides a set of datanodesfor replication. Blocks write requests are kept in a data queue. Unconfirmed block write request are kept in a ack queue. 19

20 Writing 20

21 Writing In case the datanode fails, the client changes the block id so that the corrupted replica is deleted later. By default, if one of the replicas is successfully written, the writing is considered done. The other replicas are written asynchronously. 21

22 Command Line Tool hadoop fs ls mkdir rm rmr put copytolocal copyfromlocal 22

23 Cloud Object Store: Amazon Simple Storage System (S3) 23

24 S3 Amazon s persistent object storage system. Implementation based on the Dynamo system (SOSP, 2007). Accessible using HTTP: 3 different protocols, e.g. SOAP. 24

25 Dynamo: Intuition CAP Theorem: Consistency, Availability and Partition tolerance - Pick two! At Amazon: Availability = Client s trusts Cannot be sacrificed. In large data centres there are going to be frequent faults: The possibility of a partition has to be included. Most data services tolerate small inconsistencies: Relaxed consistency ==> Eventual consistency. 25

26 Consistency Models Strong Consistency: Once a write operations is finished for the requester, any subsequent read will return the value that was written. Weak Consistency:The system does not guarantee that subsequent accesses return the written value. Some condition must be verified for the written value to be returned (a time interval, an access to a synchro variable, ). The period between the write finishing and the value visibility is called the inconsistency window. Eventual Consistency: The system guarantees that, if there no more writes, the updates will become visible for all clients (e.g. DNS): a DNS name update is propagated between zones until all clients see the new value. 26

27 Variants of Eventual Consistency Causal Consistency: Two causally related writes (A happens before B) cannot lead to B being written before A. There are no guarantees regarding write operations that are not causally related. Read-your-writes Consistency: Every time a process A writes a value, all subsequent reads must reflect that write (a particular case of causal consistency). Session Consistency:A practical implementation of the previous model. All operations are done in the context of a session. During the session, the system guarantees read-your-writes. In the case of certain faults, the session is ended and the read-your-writes guarantee is restarted. Monotonic Reads Consistency: If a process has seen a subsequent value, subsequent reads will never return a previous value. Monotonic Writes Consistency:Systems that do not guarantee ordered writes in the same process. Very rare 27

28 Dynamo Assumptions Interaction Model: Total reads and writes with unique IDs. Binary objects with up to 5GB. No operations on multiple objects. ACID properties (Atomicity, Consistency, Isolation, Durability): Atomicity/Isolation: total writes of an object. Durability: replicated write. Only the consistency isn t strong. Efficiency: Optimize for the 99,9 percentile. 28

29 Design Decisions Incremental Scalability: Adding nodes has to be simple. Load balancing and support for heterogeneity: The system must distribute the requests. And support nodes with different characteristics. Solution: nodes in a Chord like DHT. 29

30 Design Decisions Symmetry: All nodes are equally responsible peers. Decentralization: Avoid single points of failure. 30

31 Dynamo: Design Decisions Problem Technique Advantage Partitioning Consistent Hashing Incremental Stability Write Availability Vector clocks and conflict resolution of writes Version size does not depend on the update rate Temporary Faults Relaxed quorum and hinted handoff High availability and durability Permanent Faults Anti-entropy with Merkle trees Synchronizes replicas asynchronously Membership and Fault detection Gossip based membership protocol Maintains symmetry and avoids and centralized directory 31

32 Dynamo: API Two operations: put(key, context, object) key: object ID. context: vector clocks and object s history. object: data to be written. get(key) 32

33 Partitioning and Replication Uses consistent hashing. Similar to Chord: Each node has an id in the key space. Nodes are arranged in a ring. Data are stored in the node with the lowest key that is larger than the object s Replication: All objects are replicated in the N nodes that follow the node associated with the object. 33

34 The Chord Ring with Replication 34

35 Virtual Nodes Problem: few nodes or heterogeneous nodes lead to bad load balancing. Dynamo solution: Use virtual nodes Each physical nodes has several virtual node tickets. More powerful machines can have more tickets. Virtual node tickets are distributed randomly. 35

36 Data Versions Nodes for writing and reading are selected based on load. So, we have eventual consistency: There may be different versions written on different replicas. Conflict resolution is made when reading and not when writing. Syntactic Reconciliation: Some changes can be made automatically. For formats with clearly identifiable parts and operations (e.g. mail file). Semantic Reconciliation: The user must decide. Divergence is uncommon. For all read operations: 99.94% -1 version; % -2 versions; % -3 versions; % -4 versions. Timeout: After a number of generations without writing, versions are discarded. 36

37 Vector Clocks (i) Represents time in a distributed system without clock sync. Replaces physical time with causality. A vector clock is a list of (node, counter) pairs. If all positions of the vector clock time of an event A are smaller than those of another event B then A happened before B. There is a causal chain of events from A to B. 37

38 Vector Clocks (ii) Real time 38

39 Object Versions If we assign a vector clock timestamp to all object versions we can detect divergent replicas. Example: X, Y e Z are servers with replicas of object D. D5 is a semantic reconciliation performed by the user. 39

40 Executing get()e put() For good performance, two possibilities: Route requests through a load balancer that chooses the node based on the load: Creates a bottleneck. Use a client side library to choose the node where to send the request (which will be the coordinator): Requires recompiling the client. Probably irrelevant in AWS. Then the coordinator executes the quorum reads or writes. 40

41 Read/Write Operations Dynamo supports writing and reading using a quorum model. This allows not waiting for all replicas when you do an operation. Consider R and W are the number or read and write replicas that must synchronously take part in an operation. If R + W > N we have a quorum based system, then the set of replicas used for writing always overlap with the set of read replicas: It is impossible to read an object without seeing the latest written object. Latency is determined by the slowest node in the R (or W) set. Therefore, to improve performance, one lowers R or W. 41

42 Sloppy Quorum To ensure availability, Dynamo uses a sloppy quorum. Each data item is stored on N nodes of list spanning multiple machines and data centers (preference list). Operations are performed not on the N existing replicas but on the first healthy N nodes on the preference list. 42

43 Tolerating Temporary Faults: Hinted Handoff Assuming N = 3. If A is unavailable or fails when we write, send a replica to D. D marks the replica as temporary and returns the data to A as soon as it recovers. Replicas are chosen from a preference list of nodes. Preference lists always span multiple datacenters for fault tolerance. 43

44 Membership and Fault Detection Ring Membership: At startup use an external entry point to avoid partitioned rings. Gossip asynchronously to update the DHT. Exchange membership lists with random node every 2 seconds. Fault Detection: Faults are detected by neighbours with periodic messages with a timeout on reply. 44

45 Permanent Faults When a hinted replica (that has write-ops belonging to another replica) is considered failed: Data is synchronized with the new replica using Merkle trees. 45

46 Merkle Trees Accelerates synchronization between nodes by comparing trees of hashes. Each tree node has a hash of the children. It makes it very easy to identify what needs to be exchanged. The update can be asynchronous: An out-of-date tree is not serious. 46

47 Merkle Trees: Dynamo Each node has a set of keys. All objects are leafs of the Merkle tree. Replicas exchange the top of the Merkle tree periodically. If it's different, they recursively exchange the hash of lower nodes. 47

48 Back to S3 Additional issues when compared to Dynamo: Access to S3 is controlled by an ACL based on the clients AWS identity and checked with their secret key. Occasionally, some S3 calls fail and must be repeated. Programs accessing S3 should take this into account. Dynamo replication is performed between data centers. This large scale replication has some lag. 48

49 Service Level Agreements Hosting contracts and cloud platforms, like S3, include SLAs. Very often described as average, median and/or variances of response times: Extreme cases are always problematic. Amazon optimizes for 99,9% of the requests: Example: 300ms response time for 99,9% of the requests below a peak request rate of 500 request per second. 49

50 Buckets and Objects S3 data are stored as Dynamo objects. Operations on objects are: PUT, GET, DELETE, HEAD (get metadata) Objects can be grouped in buckets. Buckets are used for delimiting namespaces:

51 S3: REST GET Sample Request GET /my-image.jpg HTTP/1.1 Host: bucket.s3.amazonaws.com Date: Wed, 28 Oct :32:00 GMT See developer-guide/restauthentication.html Authorization: AWS 02236Q3V0WHVSRW0EXG2:0RQf4/cRonhpaBX5sCYVf1bNRuU= Sample Response HTTP/ OK x-amz-id-2: eftixk72ad6ap51tnqcof8efidjg9z/2mkidfu8yu9as1ed4opiszj7udnehgran x-amz-request-id: 318BC8BC148832E5 Date: Wed, 28 Oct :32:00 GMT Last-Modified: Wed, 12 Oct :50:00 GMT ETag: "fba9dede5f27731c a " Content-Length: Content-Type: text/plain Connection: close Server: AmazonS3 [ bytes of object data] 51

52 S3: REST PUT Sample Request PUT /my-image.jpg HTTP/1.1 Host: mybucket.s3.amazonaws.com Date: Wed, 12 Oct :50:00 GMT Authorization: AWS 15B4D3461F A:xQE0diMbLRepdf3YB+FIEXAMPLE= Content-Type: text/plain Content-Length: Expect: 100-continue [11434 bytes of object data] Sample Response HTTP/ Continue HTTP/ OK x-amz-id-2: LriYPLdmOdAiIfgSm/F1YsViT1LW94/xUQxMsF7xiEb1a0wiIOIxl+zbwZ163pt7 x-amz-request-id: 0A49CE EAC x-amz-version-id: 43jfkodU8493jnFJD9fjj3HHNVfdsQUIFDNsidf038jfdsjGFDSIRp Date: Wed, 12 Oct :50:00 GMT ETag: "fbacf535f27731c a " Content-Length: 0 Connection: close Server: AmazonS3 52

53 S3: REST in Java public void createbucket() throws Exception { // S3 timestamp pattern. String fmt = "EEE, dd MMM yyyy HH:mm:ss "; SimpleDateFormat df = new SimpleDateFormat(fmt, Locale.US); df.settimezone(timezone.gettimezone(" GMT")); // Data needed for signature String method = "PUT"; String contentmd5 = ""; String contenttype = ""; String date = df.format(new Date()) + "GMT"; String bucket = "/onjava"; // Generate signature StringBuffer buf = new StringBuffer(); buf.append(method).append("\n"); buf.append(contentmd5).append("\n"); buf.append(contenttype).append("\n"); buf.append(date).append("\n"); buf.append(bucket); String signature = sign(buf.tostring()); // Connection to s3.amazonaws.com HttpURLConnection httpconn = null; URL url = new URL("http","s3.amazonaws.com",80,bucket ); httpconn = (HttpURLConnection) url.openconnection(); httpconn.setdoinput(true); httpconn.setdooutput(true); httpconn.setusecaches(false); httpconn.setdefaultusecaches(false); httpconn.setallowuserinteraction(true); httpconn.setrequestmethod(method); httpconn.setrequestproperty("date", date); httpconn.setrequestproperty("content- Length", "0"); String AWSAuth = "AWS " + keyid + ":" + signature; httpconn.setrequestproperty("authorizat ion", AWSAuth); // Send the HTTP PUT request. int statuscode = httpconn.getresponsecode(); if ((statuscode/100)!= 2) { // Deal with S3 error stream. InputStream in = httpconn.geterrorstream(); String errorstr = gets3errorcode(in); 53 }}

54 S3: REST in JetS3t String awsaccesskey = "YOUR_AWS_ACCESS_KEY"; String awssecretkey = "YOUR_AWS_SECRET_KEY"; AWSCredentials awscredentials = new AWSCredentials(awsAccessKey, awssecretkey); S3Service s3service = new RestS3Service(awsCredentials); S3Bucket eubucket = s3service.createbucket("eu-bucket", S3Bucket.LOCATION_EUROPE); 54

55 Windows Azure 55

56 Azure Storage (i) Volatile storage: Instance disk Memory cache Persistent Storage: Windows Azure Storage: Blobs (objects) Tables Queues SQL Azure: Relational DB 56

57 Azure Storage (ii) Service is accessible via Web Services or libraries on top of these (C#, VB, Java). Blobs, Tables e Queues are stored in partitions. Partitions are the replication and load balancing unit. Blobs and queues are not sharded. Tables may be. All partitions have 3 replicas. Partitions are represented in a DFS as one or more extents (contiguous files) of up to 1GB. 57

58 Blobs A blobis a <name, object> pair. Allows storage of objects from a few bytes up to 50GB. Blobs are stored in containers. There is no hierarchy in blob storage but it can be simulated because names may contain / s. URLs schema: ntainer>/<blobname> 58

59 Operations on Blobs Put: creating Get: reading Set: updating Delete: eliminating Lease: 1 minute locking. 59

60 Next Time... Storage in Cloud Platforms 60

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 13 Cloud Storage 2011-2012 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling Map Reduce 1