Cloud Computing Lectures 15 and 16 Cloud Storage 2011-2012 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling Map Reduce Cloud Storage File Systems Object Storage 1
Table Storage Services: Google Big Table Hadoop HBase Amazon SimpleDB Microsoft Azure Tables Messaging Services: Amazon SQS Google App Engine Queues Microsoft Azure Queues Relational Database Services: Amazon RDS Azure SQL Outline Table Storage Services 2
Motivation: Why Simple Table Storage? The concept of a record (a row) is essential for most object representations. Having an indexed data structure is essential for queries. Data model matches well DB concepts. ACID transactions on a conventional relational model do not scale. Google BigTable Google table storage service. Used, e.g., for crawling the web. Google Datastore is based on BigTable. 3
Data Model A table has entities (records, rows, ). A table has column families: Created statically (e.g. family = anchor ) A column family is like a property. Each column family has columns: Created dynamically (e.g. column= anchor:cnnsi.com ) A column is an instance of a property. Each column is time stamped: An entity can have several versions with different timestamps. Stored in reverse chronological order. Entities Entities are ordered alphabetically. Entities can be read, written, deleted, scanned (key prefix selection), range scanned (key range selection) ACID transactions only on a single entity. Sequences of entities are stored in tablets. The choice of key determines which entities are nearby in the tablet!! 4
Example: Reading and Iterating // Open the table Table *T = OpenOrDie("/bigtable/web/ webtable"); // Write a new anchor and delete an old anchor RowMutation r1(t, "com.cnn.www"); r1.set("anchor:www.cspan.org", "CNN"); r1.delete("anchor:www.abc.co m"); Operation op; Apply(&op, &r1); Scanner scanner(t); ScanStream *stream; stream = scanner.fetchcolumnfamily ("anchor"); Stream-> SetReturnAllVersions(); scanner.lookup("com.cnn.www" ); for (;!stream->done(); stream->next()) { printf("%s %s %lld %s\n", scanner.rowname(), stream->columnname(), stream->microtimestamp(), stream->value()); } Tablets Each tablet contains a contiguous sequence of entities. Tablets are stores in SSTablefiles within GFS (or HDFS in Hadoop). SSTable, file format: Index of <key, value> pairs + 64kB data blocks. 64K 64K 64K SSTable Index 5
Components The client application uses a library to interact with the service. Master performs operations on: Tables: creation. Column families: creation. Tablets: allocation, elimination, load balancing tablets among tablet servers. Tablet Server: Performs reading, writing and partitioning tablets that are too large (> 200MB). Metadata: Where are the tablets? BigTableuses Chubby, a locking and small-file repository for: Locating BigTable s Metadata. Checking whether tablet servers are alive. Chubby uses five replicated servers executing the Paxos election algorithm. Each operation on Chubby (insertion, deletion,...) triggers an election between the Chubby nodes. 6
Primary Election Distributed consensus problem. Asynchronous communication causes: loss, delay, reordering FLP (Fischer-Lynch-Patterson) impossibility result: In an asynchronous system with no common timing, consensus is impossible. Solution: Paxos protocol Paxos: Problem Collection of processes proposing values Only proposed value may be chosen Only single value chosen Learn of chosen value only when it has been voted Nodes can be proposers, acceptors, learners Asynchronous, non-byzantine model: Arbitrary speeds, fail by stopping, restart Messages not corrupted 7
Paxos: Algorithm Phase 1 (a) Proposer sends prepare request with #n (b) Acceptor: if n > # of any other prepareit has replied to, respond with promise. Phase 2 (a) If majority reply, proposer sends accept with value v (b) Acceptor accepts unless it responded to preparewith # higher than n. Paxos: Algorithm 8
Replicated state machine: Paxos: State Machines Same state if same sequence of ops. Performed. Client sends requests to server: Replicated with Paxos. Paxosused to agree on order of client ops.: Can have failures / more than 1 master. Paxosguarantees only 1 value chosen & replicated. Metadata Each metadata tablethas up to 128MB. Clients maintain a cache of tablet locations. 9
Writing Writing is done on two tablet server structures: In a redo log (one per server) and In a memory temporary table, memtable. Compactions of the memtable into SSTables: Minor: when the memtablereaches a certain size, it s converted into an SSTable. Reduces log and memory occupation. Merge: Groups SSTables from minor compactions. Major: Converts SSTablesto a minimum by filtering entity removals. Fault Tolerance At startup, the mastercompares the list of live servers and the tablets they claim to have with the metadata and updates the tablet distribution. Every time a tablet server dies, the master is notified. 10
Use at Google (2006) Datastore: BigTablefor Programmers Object storage over BigTable: Object = entity. Each entity is accessible by a single application. Each entity has a system unique id that includes app id and kind id. Kind is a namespace for entities. Each entity has an app-wide id. Data are stores in the entities columns. Columns have a name and a value. 11
Datastore: Example DatastoreService ds = DatastoreServiceFactory.getDatastore Service(); Entity book = new Entity("Book"); book.setproperty("title", "The Grapes of Wrath"); book.setproperty("author", "John Steinbeck"); ds.put(book); ds.delete( Book ); Datastore: DB-like support An intermediate layer (Megastore) executes queries, builds indexes and performs multi-register transactional support. Indexes: Kind Index (kind, key): object index with one kind per key. Single-property index (kind, name, value key): kind/column/key indexes, that are created on demand (both ascending and descending versions). Composite index: defined by the user inside datastoreindexes.xml, created before running the app. 12
Transactions BigTable supports only transactions on one entity. Datastoresupports transactions on several entities. However Transactions can only operate on one server (same entity group). No distributed transactions. The usual relational DB scalability issues Datastore transactions: Use optimistic concurrency control. Cannot be nested (no sub-transactions). Are cancelled at the end of the servlet(gae is web app oriented) if they are not confirmed. Transactions: Example try { Transaction txn = ds.begintransaction(); try { boardkey = KeyFactory.createKey("MessageBoard", boardname); messageboard = ds.get(boardkey); } catch (EntityNotFoundException e) { messageboard = new Entity("MessageBoard", boardname); messageboard.setproperty("count", 0); boardkey = ds.put(messageboard); } txn.commit(); //ou txn.rollback() } catch (DatastoreFailureException e) { // Report an error... } 13
Google Query Language Simplified database query language for Google Datastore. All queries are transformed in scan operations on index BigTables. A small subset of SQL. Only supports SELECT. Has no INSERT, UPDATE or DELETE. Google Query Language SELECT [* key ] FROM <kind> [WHERE <condition> [AND <condition>...]] [ORDER BY <property> [ASC DESC] [, <property> [ASC DESC]...]] [LIMIT [<offset>,]<count>] [OFFSET <offset>] <condition> := <property> {< <= > >= =!= } <value> <condition> := <property> IN <list> <condition> := ANCESTOR IS <entity or key> 14
SimpleDB Hierarchical data storage. Based on S3. Adds multiple attributes, indexing and queries. Ad hocdata storage: No system administration cost. No schema. Automatically scalable. Efficient for data where read operations dominate, due to eventual consistency. If conflicts are too common, system becomes inefficient. E.g. forums, metadata, backups. SimpleDB: Data Model Dominions: Equivalent to a table. Identified by a string. 100/account. 10 GB/dominion. Items Identified by a string. Unlimited number per dominion. Attributes <key, value> pairs. No types, just strings. Automatically indexed. 256/item. 250 million/domain. 1KB/attribute. 15
SimpleDB: Missing Features No transactions. No notifications. May be compensated by a messaging service like SQS. No ordering: Must be done at the client. No joins. No types: Only string comparisons. Care must be taken to ensure that comparisons are accurate, e.g.add prefix zeros to numbers ( 001, 002, 003 and not 1,2,3 ). Does not store bags of bytes : only 1 KB. For large objects, S3 must be used directly. SimpleDB: Queries Select: allows querying the domain select target from domain_name where query_expression. Supported operators: =,!=, <, > <=, >=, like, not like, between, is null, is not null. Example: select * from mydomain where keyword = Book. Supports ordering with the SORT operator. And counting using count(). 16
Windows Azure Tables Similar to BigTable. Storage hierarchy: Table -> Entity -> Property -> <name, type, value> Supported types: Binary, Bool, DateTime, Double, GUID, Int, Int64, e String. URL Schema: http://<storageaccount>.table.core.windows.net/<tablenam e>?$filter=<query> Optimistic concurrency control. 17
On Tables: Operations on Tables Create, Delete, Query Example (GET using REST): http://myaccount.table.core.windows.net/customers()?$filter=lastna me%20eq%20'smith'%20and%20firstname%20eq%20'john' On Entities: Insert, Update, Merge, Delete It s possible to perform transactions by grouping operations on entities. Using SOAP, POST the operations list to: http://myaccount.table.core.windows.net/$batch Messaging Services 18
Messaging Services Why? Participants can be weakly connected: No network connection. No simultaneous execution. No binary compatibility. Very useful for: Connecting heterogeneous/legacy systems. Workflow systems. Processes can be manipulated by adding/removing/replicating messages. Examples: Amazon Simple Queue Service (SQS). Microsoft Azure Queues. Communication Service: Reliable. Amazon SQS: Simple Queue Service Persistent (1 hour to 2 weeks; default: 4days). A message is a block of text with up to 8 kb. Queues store messages until they are delivered. A queue stores related messages and can be configured with specific delivery and access control options. 19
SQS: Consistency Message queues are replicated for fault tolerance and scalability: When the queue is read, a quorum of replicas is contacted and therefore all messages may not be read. Message delivery is triggered by the receiver. Therefore no delivery times are guaranteed. SQS does not enforce/guarantee message ordering. The delivery semantics is at least once. SQS: Programming Guidelines Use a idempotent message protocol: I.e. don t design operations that pressuposea particular application state, e.g.: Choose SetValue(v,i) and not IncrementValue(v,i). Choose NewPosition(x,y) and not MoveForward(). Don t use SQS for applications with timing restrictions, e.g. transactional systems. 20
SQS: Operations CreateQueue ListQueues DeleteQueue: by default, only deletes empty queues. SendMessage ReceiveMessage: does not remove messages. Makes them invisible. PeekMessage: Read a message without changing the queue. DeleteMessage: Removes the message from the queue. SQS: Java Example //using the Queue Java library public class SampleDriver { static public String accesskeyid = ""; static public String secretaccesskey = ""; static public String QueueServiceURL = "http://queue.amazonaws.com/"; static private String queuename = "SQS-Test-Queue-Java"; static private String testmessage = "This is a test message."; public static void main(string[] args) throws Exception { testqueue = Queue.createQueue(queueName); List<Queue> queues = Queue.listQueues(queueName); for(queue queue : queues) { if(queue.getqueueendpoint().equals(t estqueue.getqueueendpoint())) { System.out.println("Queue found");} } String msgid = testqueue.sendmessage(testmessage ).getid(); String qcount = testqueue.getapproximatenumberofm essages(); List<QMessage> messages = testqueue.receivemessage(1); do { Thread.sleep(1000); // wait for a second messages = testqueue.receivemessage(1); } while (messages.size() == 0); QMessage message = messages.get(0); testqueue.deletemessage(message.g etreceipthandle()); }} 21
Microsoft Azure Queues Reliable and persistent messaging service. Unlimited queues per account and messages per queue. Each message can have up to 8 kb. Fault tolerance mechanism similar to Amazon: when messages are read they become invisible for a period. Operations on Queues One can reference queues and messages, e.g.: http://<storageaccount>.queue.core.windows.net/<queue Name> Operations on Queues (REST): http://myaccount.queue.core.windows.net?comp=list (list queues) Create, Delete Operations on Messages (REST): Put, Get, Peek, Delete 22
Queues in C#: Writing StorageAccountInfo account = new StorageAccountInfo( baseuri, null, accountname, accountkey); QueueStorage service = QueueStorage.Create(account); MessageQueue queue = service.getqueue("messages"); if (!queue.doesqueueexist()) { queue.createqueue(); } Message msg = new Message(txtMessage.Text); queue.putmessage(msg); Queues in C#: Reading StorageAccountInfo account = new StorageAccountInfo( baseuri, null, accountname, accountkey); QueueStorage service = QueueStorage.Create(account); MessageQueue queue = service.getqueue("messages"); if (queue.doesqueueexist()) { Message msg = queue.getmessage(); if (msg!= null) { RoleManager.WriteToLog("Information", string.format("message '{0}' processed.", msg.contentasstring())); queue.deletemessage(msg); }} 23
Relational Database Services Amazon RDS: Relational Database Service AWS relational database. Goal: Simplify porting applications. Take advantage of low latency inside the cluster: Co-locate apps and DB. Based on MySQL. Supports automatic backups. Supports passive replication in different data centers (Multi Access Zone) for fault tolerance. Supports read replicas for load balancing. Accessed using sysadmin command line tools: DB creation returns a DNS name. From that point on, it s a conventional MySQL server. 24
Variants Small BD: 1.7 GB RAM, 1 core Large BD: 7.5 GB RAM, 2 cores XL BD : 15 GB RAM, 4 cores Double XL BD: 34 GB RAM, 4 cores Quadruple BD XL: 68 GB RAM, 8 cores Disk: from 5GB to 1TB Transitions in scheduled moments with up to 2 hours of downtime. SQL Azure Reporting Business Analytics Data Sync Relational database service: High availability, automatic maintenance. The fabric controller monitors the server load and redistributes the partitions with a higher load. 25
SQL Azure Based on SQL Server. Replication on 3 copies. Strong coherence: When a write operation returns, all replicas were updated. Maximum DB size: 10GB Acess using: OBDC, JDBC, ADO.NET, LINQ SQL Azurevs. AmazonRDS Size: RDS, upto 1TB SQL Azure10GB Specificity: Azureisdesignedfor thecloud, RDS isjusta MySQLEC2 instance. Configurability: The RDS instance can be configured. Compatibility: RDS isfull-fledgedmysql. SQL Azureisa subseto T-SQL. (Price: Different ways and prices to charge for DB, bandwidht andram ) 26
Storage: Overview AWS Microsoft Azure Google / Hadoop SQL RDS SQL Azure X Tables SimpleDB Tables Datastore [BigTable]/ HBase Objects/Blocks S3 Blobs (GFS)/ HDFS Queues Simple Queue Service(SQS) Queues Task Queues Storage: Comparison There are two general complaints: Perfomance(latency). Coherence models do not scale. The problem of scalability is not solved. There are no reliable benchmarks. The market is still in a very dynamic phase. Google storage services are not accessible remotely. Although you can create an intermediate service. 27
Next Time... Execution and Programming Models in Cloud Computing 28