nosql DBMS: Concepts, Techniques and Patterns Iztok Savnik, FAMNIT, 2017

Size: px

Start display at page:

Download "nosql DBMS: Concepts, Techniques and Patterns Iztok Savnik, FAMNIT, 2017"

Sheryl Johnston
6 years ago
Views:

1 nosql DBMS: Concepts, Techniques and Patterns Iztok Savnik, FAMNIT, 2017

2 Contents Requirements Consistency model Partitioning Storage Layout Query Models

3 Requirements Requirements of the new web applications ACID vs. BASE CAP theorem

4 ACID vs. BASE ACID Atomicity Consistency Isolation Durability Characteristics Strong consistency Focus on commit Nested transactions Availability? Conservative (pessimistic) Difficult evolution BASE Basically Available Soft-state Eventual consistency Characteristics Weak consistency Availability first Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution

5 CAP theorem Towards Robust Distributed Systems at ACM s PODC 2000, Eric Brewer CAP theorem Consistency System is in a consistent state after the execution of an operation After an update of some writer all readers see his updates in some shared data source Availability A system is designed and implemented in a way that allows it to continue operation if e.g. nodes in a cluster crash or some hardware or software parts are down due to upgrades Partition tolerance The ability of the system to continue operation in the presence of network partitions Theorem: You can have at most two of these properties for any shared-data system Widely adopted today by large web companies

6 CAP theorem

7 Forfeit Partitions Examples Single - site databases Cluster databases LDAP xfs file system Traits 2 - phase commit cache validation protocols

8 Forfeit Availability Examples Distributed databases Distributed locking Majority protocols Traits Pessimistic locking Make minority partitions unavailable

9 Forfeit Consistency Examples Coda Web cachinge DNS Traits expirations/leases conflict resolution optimistic

10 Consistency Models A consistency model determines rules for visibility and apparent order of updates. System is in a consistent state after the execution of an operation Example: Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A? Consistency is a continuum with tradeoffs Consistency models Timestamps Optimistic Locking Vector Clocks Multiversion Storage

11 Strict Consistency All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to Implies either: All operations for a given row go to the same node (replication for availability) or nodes employ some kind of distributed transaction protocol (eg 2 Phase Commit or Paxos) CAP Theorem: Strict Consistency can t be achieved at the same time as availability and partition-tolerance.

12 Eventual Consistency As t, readers will see writes. In a steady state, the system is guaranteed to eventually return the last written value For example: DNS, or MySQL Slave Replication (log shipping) Special cases of eventual consistency: Read-your-own-writes consistency ( sent mail box) Causal consistency (if you write Y after reading X, anyone who reads Y sees X) gmail has RYOW but not causal!

13 Timestamps and Vector Clocks Determining a history of a row Eventual consistency relies on deciding what value a row will eventually converge to In the case of two writers writing at the same time, this is difficult Timestamps are one solution, but rely on synchronized clocks and don t capture causality Vector clocks are an alternative method of capturing order in a distributed system

14 Vector Clocks A vector clock is a tuple {t, t,..., t } of clock values from each 1 2 n node v < v if: 1 2 For all i, v 1i v 2i For at least one i, v 1i < v 2i v < v implies global time ordering of events 1 2 When data is written from node i, it sets t to its clock value. i This allows eventual consistency to resolve consistency between writes on multiple replicas.

15 Optimistic Concurency Control Locking is conservative approach that avoids conflicts Additional code for locking Checking deadlock Lock saturation for all objects that are frequently used If conflicts are rare we can get concurrent transactions by not locking Instead we check the conflicts before commiting transaction Optimistic control is faster than locking if there are not a lot of conflicts (worse when there are many conflicts)

16 Kung-Robinson model Transaction has 3 phasis: READ: Transaction reads from the databse but makes changes to a local copy. VALIDATE: validate changes. WRITE: make local changes public. old changed objects ROOT new

17 Validation Check the conditions that are sufficient Every transaction gets numerical timestamp Transaction IDs are assingned at the end of READ phase, just before validation ReadSet(Ti): Set of objects read by transaction Ti WriteSet(Ti): Set of objects changed by Ti

18 Optimistic Locking Unique counter or clock value is saved for each piece of data. Client on update provides the counter/clock-value of the revision it likes to update Optimistic locking scheme needs a total order of version numbers To allow causality reasoning on versions e. g. which revision is considered the most recent by a node on which an update was issued a lot of history has to be saved Downside of this procedure Project Voldemort development team does not work well in a distributed and dynamic scenario where servers show up and go away often and without prior notice such a total order easily gets disrupted in a distributed and dynamic setting of nodes, the Project Voldemort team argues

19 Optimistic Locking Couchbase optimistic locking We optimistically speculate that there will be no conflict Before commit we check if there are conflicts Commit is issued if there are no conflicts Handling one read/write operation is easier than the complex transaction Example: Let s assume we are building an online wikipedia like application using Couchbase Server: users can update an article and add newer articles. Let s assume Alice is using this application to edit an article on bicycles to correct some information. Alice opens up the article and makes those changes but before she hits save, she gets distracted and walks away from her desk. In the meantime, let s assume Joe notices the same error in the bicycle article and wants to correct the mistake. If optimistic locking is used in the application, Joe can edit the article and save his changes. When Alice returns and wants to save her changes, either Alice or the application will want to handle the latest updates before allowing Alice s action to change the document. Optimistic locking takes the optimistic view that data conflicts due to concurrent edits occur rarely, so it s more important to allow concurrent edits.

20 Optimistic Locking Couchbase pessimistic locking Application simply locks the objects and unlocks it after commit Example: Now let s assume that your business process requires exclusive access to one or more documents or a graph of documents. Referring to our previous example, when Alice is editing the document she does not want any other user to edit the same document. If Joe tries to open the page, he will have to wait until Alice has released the lock. With pessimistic locking, the application will need to explicitly get a lock on the document to guarantee exclusive user access. When the user is done accessing the document, the locks can be removed either manually or using a timeout.

21 Multiversion Concurrency Control MVCC is a concurrency control method commonly used by database management systems to provide concurrent access to the database Isolation is the property that provides guarantees in the concurrent accesses to data Isolation is implemented by means of a concurrency control protocol MVCC aims at solving the problem by keeping multiples copies of each data item On update MVCC creates a newer version of the data item Isolation level commonly implemented with MVCC is snapshot isolation All reads made in a transaction will see a consistent snapshot of the database It reads the last committed values that existed at the time it started How to remove versions that become obsolete? PostgreSQL adopts this approach with its VACUUM proces

22 Multiversion Concurrency Control Each table cell has a timestamp Timestamps don t necessarily need to correspond to real life Multiple versions can exist concurrently for a given row Reads may return most recent, most recent before T, etc. (free snapshots) System may provide optimistic concurrency control with compareand-swap on timestamps SQL Anywhere, InterBase, Firebird, Oracle, PostgreSQL, MongoDB and Microsoft SQL Server (2005 and later) Not the same as serializability

23 Vector Clock - Details A vector clock is defined as a tuple V[0], V[1],..., V[n] of clock values from each node. In a distributed scenario node i maintains such a tuple of clock values, which represent the state of itself and the other (replica) nodes state as it is aware about at a given time V i [0] for the clock value of the first node, V i [1] for the clock value of the second node,... V i [i] for itself,... V i [n] for the clock value of the last node. Clock values may be real timestamps derived from a node s local clock, version/revision numbers or some other ordinal values.

24 Vector Clocks Vector clocks are updated in a way defined by the following rules If an internal operation happens at node i, this node will increment its clock V i [i]. This means that internal updates are seen immediately by the executing node. If node i sends a message to node k, it first advances its own clock value V i [i] and attaches the vector clock V i to the message to node k. Thereby, he tells the receiving node about his internal state and his view of the other nodes at the time the message is sent. If node i receives a message from node j, it first advances its vector clock V i [i] and then merges its own vector clock with the vector clock V message attached to the message from node j so that: V i = max(v i, V message ) To compare two vector clocks V i and Vj in order to derive a partial ordering, the following rule is applied: V i > V j, if k V i [k] >= V j [k] and l V i [l] > V j [l] If neither V i > V j nor V i < V j applies, a conflict caused by concurrent updates has occurred and needs to be resolved by e.g. a client application.

25 Vector Clocks

26 Vector Clocks Casual reasoning between updates Clients participate in the vector clock scenario They keep a vector clock of the last replica node they have talked to and use this vector clock depending on the client consistency model that is required For monotonic read consistency a client attaches this last vector clock it received to requests and the contacted replica node makes sure that the vector clock of its response is greater than the vector clock the client submitted Clients can be sure to see only newer versions of some piece of data Advantages of vector clocks are No dependence on synchronized clocks No total ordering of revision numbers required for casual reasoning No need to store and maintain multiple revisions of a piece of data on all nodes

27 Gossip Gossip is one method to propagate a view of cluster status. Every t seconds, on each node: The node selects some other node to chat with. The node reconciles its view of the cluster with its gossip buddy. Each node maintains a timestamp for itself and for the most recent information it has from every other node. Information about cluster state spreads in O(log n) rounds (eventual consistency) Scalable and no SPOF, but state is only eventually consistent

28 Gossip 4 rounds of the protocol

29 Propagate State via Gossip Gossip protocol operates in either a state or an operation transfer model to handle read and update operations as well as replication among database nodes State Transfer Model Data or deltas of data are exchanged between clients and servers as well as among servers Database server nodes maintain vector clocks for their data and also state version trees for conflicting versions versions where the corresponding vector clocks cannot be brought into a V A < V B or V A > V B relation Clients maintain vector clocks for pieces of data they have already requested or updated Vector clocks are exchanged and processed in the following manner Query Processing When a client queries for data it sends its vector clock of the requested data along with the request Server node responds with part of his state tree for the piece of data that precedes the vector clock attached to the client request and the server s vector clock Client has to resolve potential version conflicts Update Processing Internode Gossiping

30 Propagate State via Gossip Operation Transfer Model Operations applicable to locally maintained data are communicated among nodes in the operation transfer mode lesser bandwidth is consumed to interchange operations in contrast to actual data or deltas of data Importance to apply the operations in correct order on each node a replica node first has to determine a casual relationship between operations (by vector clock comparison) before applying them to its data In this model a replica node maintains the following vector clocks V state : The vector clock corresponding to the last updated state of its data. V i : A vector clock for itself where compared to V state merges with received vector clocks may already have happened V j : A vector clock received by the last gossip message of replica node j (for each of those replica nodes). Exchange and processing vector clocks in read-, update- and internode-messages Complex protocols for handling queue of (deferred) operations correct (casual) order reasoned by the vector clocks attached to these operations

31 Partitioning Assuming that data in large scale systems exceeds the capacity of a single machine and should also be replicated to ensure reliability and allow scaling measures such as load-balancing, Ways of partitioning the data of such a system have to be thought about Depending on the size of the system and the other factors like dynamism (e. g. how often and dynamically storage nodes may join and leave) There are different approaches to this issue: Memory Caches Clustering Master(s)/slaves model Sharding Consistent Hashing

32 Memory Caches Like memcached (memcached.org/) partitioned though transient in-memory databases replicate most frequently requested parts of a database to main memory rapidly deliver data to clients and disburden database servers significantly consists of an array of processes with an assigned amount of memory launched on several machines in a network made known to an application via configuration. memcached protocol implementation in 3 different programming languages client applications provided a simple key-/value-store API 3 It stores objects placed under a key into the cache by hashing that key against the configured memcached-instances. If a memcached process does not respond, most API-implementations ignore the non-answering node and use the responding nodes instead implicit rehashing of cache-objects as part of them gets hashed to a different memcached-server after a cache-miss;

33 Memory Caches formerly non-answering node joins the memcached server array again keys for part of the data get hashed to it again after a cache miss and objects now dedicated to that node will implicitly leave the memory of some other memcached server they have been hashed to while the node was down memcached applies a LRU 4 strategy for cache cleaning and allows it to specify timeouts for cache-objects Logic Half in Client, Half in Server Clients understand how to choose which server to read or write to for an item, what to do when it cannot contact a server. The servers understand how store and fetch items. They also manage when to evict or reuse memory. O(1) All commands are implemented to be as fast and lock-friendly as possible. This gives allows near-deterministic query speeds for all use cases. Queries on slow machines should run in well under 1ms. High end servers can serve millions of keys per second in throughput. Other implementations of memory caches are available e. g. for application servers like JBoss

34 Database clusters High Availability all components of a system from end-to-end provide an uninterrupted service High availability can be achieved through redundancy or very fast recovery. Clustering of database servers is another approach to partition data strives for transparency towards clients who should not notice talking to a cluster of database servers instead of a single server Scale the persistence layer of a system to a certain degree many criticize that clustering features have only been added on top of DBMSs that were not originally designed for distribution MySQL cluster Shared-nothing clustering and auto-sharding Designed to provide high availability and high throughput with low latency, while allowing for near linear scalability

35 Database clusters Clustering uses multiple servers to work with a shared database system Each server within a cluster is called a node. If the primary node fails, then the virtual database service will become active on one of the secondary nodes Low disk and network latency essential Some clustering systems operate at the Operating System level not the database level e.g. MS-SQL MySQL InnoDB InnoDB cluster is a collection of products that work together A group of MySQL servers can be configured in a cluster using MySQL Shell The cluster has a single master (the primary) = read-write master Multiple secondary servers are replicas of the master A minimum of three servers are required to create a high availability cluster A client application is connected to the primary via MySQL Router

36 Database clusters MySQL InnoDB InnoDB is a general-purpose storage engine that balances high reliability and high performance. Key advantages of InnoDB include: Its DML operations follow the ACID model, with transactions featuring commit, rollback, and crash-recovery capabilities to protect user data. Row-level locking and Oracle-style consistent reads increase multi-user concurrency and performance. Data is arranged on disk so as to optimize queries based on primary keys. To maintain data integrity, InnoDB supports FOREIGN KEY constraints

37 Master(s)/Slaves models Each data partition has a single master and multiple slaves Write-operations for all or parts of the data are routed to master(s) Number of replica-servers satisfying read-requests (slaves) Read/write protocol If the master replicates to its clients asynchronously there are no write lags If the master crashes before completing replication to at least one client the write-operation is lost If the master replicates writes synchronously to at least one slave Read requests can go to any replicas if the client can tolerate some degree of data staleness "mastership" happens at the virtual node level There is NO single particular physical node that plays the role as the master Therefore, the write workload is also distributed across different physical node When a physical node crashes The masters of certain partitions will be lost The most updated slave will be nominated to become the new master

38 Master(s)/Slaves models Master Slave model works very well in general when the application has a high read/write ratio It also works very well when the update happens evenly in the key range So it is the predominant model of data replication 2 ways how the master propagate updates to the slave State transfer and Operation transfer The state transfer model is more robust against message lost Deltas are used many times (hash trees of the objects) Multi-master model allows updates to happen at any replica One master / slaves model unable to spread the workload evenly Problems to retain consistency

39 Master(s)/Slaves models Consistency of Master(s) models Strict consistency can be achieved by 2PC bring all N replicas to the same state at every update prepare" phase where the coordinator ask every replica to confirm whether each of them is ready to perform the update After gathering all N replicas responses positively "commit" phase Ask every replicas to commit If any one of the replica crashes, the update will be unsuccessful Quorum based 2PC the coordinator only need to update W<N replicas Write to all the N replicas but only wait for positive acknowledgment for any W On read we need to access R replicas (W+R>N) and return the latest (timestamp) If the client can tolerate a more relax consistency model We don't need to use the 2PC commit or quorum based protocol

40 Sharding Partition the data in such a way that data typically requested and updated together resides on the same node There are different interpretations of the sharding Horizontal partitioning of the data store (not single table!) Each shard may have the same schemata Usually some subset of attributes are used for the definition of partitioning These attributes form the shard key (sometimes referred to as the partition key) Sharding physically organizes the data Load and storage volume is evenly distributed among the servers Data shards may also be replicated for reasons of reliability and loadbalancing It may be either allowed to write to a dedicated replica only or to all replicas maintaining a partition of the data.

41 Sharding When an application stores and retrieves data Sharding logic directs the application to the appropriate shard This sharding logic can be implemented as part of the data access code in the application, or it could be implemented by the data storage system Benefits of sharding You can scale the system out by adding further shards running on additional storage nodes. A system can use off-the-shelf hardware rather than specialized and expensive computers for each storage node. You can reduce contention and improve performance by balancing the workload across shards. In the cloud, shards can be located physically close to the users that'll access the data. Sharding strategies (MS-SQL) Three strategies are commonly used when selecting the shard key and deciding how to distribute data across shards

42 Sharding Lookup strategy The mapping between a virtual shard and the physical partitions the sharding logic implements a map that routes a request for data to the shard that contains that data using the shard key

43 Sharding Range strategy This strategy groups related items together in the same shard, and orders them by shard key the shard keys are sequential Data is usually held in row key order in the shard Items that are subject to range queries and need to be grouped together can use a shard key that has the same value for the partition key but a unique value for the row key.

Sharding Hash strategy The sharding logic computes the shard to store an item in based on a hash of one or more attributes of the data The purpose of

44 Sharding Hash strategy The sharding logic computes the shard to store an item in based on a hash of one or more attributes of the data The purpose of this strategy is to reduce the chance of hotspots Achieves a balance between the size of each shard and the average load that each shard will encounter

45 Sharding Advantages and considerations Lookup. This offers more control over the way that shards are configured and used Using virtual shards reduces the impact when rebalancing data New physical partitions can be added to even out the workload. The mapping between a virtual shard and the physical partitions can be modified without affecting application code Range. This is easy to implement and works well with range queries They can often fetch multiple data items from a single shard in a single operation. This strategy offers easier data management For example, users in the same region are in the same shard This strategy doesn't provide optimal balancing between shards. Hash. Rebalancing shards is difficult and might not resolve the problem of uneven load This strategy offers a better chance of more even data and load distribution. Request routing can be accomplished directly by using the hash function. There's no need to maintain a map. Rebalancing shards is difficult.

46 Consistent Hashing The topic is modern It is motivated by issues in present-day systems Web caching, Peer-to-peer systems, Database partitioning it s general and flexible enough to be useful for other problems Web caching was original motivation for consistent hashing, 1997 Store requested pages in the local cache Next time there is no need to retrieve the page the from remote source Obvious benefits: less Internet traffic, faster access The amount of the data used to store pages cached for 100 machines can be substantial Where to store them? One machine? Cluster? Use hash function to map URLs to caches First intuition of the computer scientist? Nice properties of hash functions Behaves like a random function, spreading data out evenly and without noticeable correlation across the possible buckets Designing good hash functions is not easy

47 Consistent Hashing Say there are n caches, named {0, 1, 2,..., n 1}. We store the Web page with URL x at the cache server h(x) mod n h(x) way bigger than n Solution works well if you do not change n All elements would have to be re-hashed if n changes n changes if a cache is added or a cache fails Consistent hashing What we want? hash table-type functionality and changing n does not affect hashing and almost all objects stay assigned to the same cache when n changes Leading idea: hash server names and URLs with the same hash function Which objects are assigned to which caches

48 Consistent Hashing Inserting a new object x to the database Given an object x that hashes to the bucket h(x) Imagine we have already hashed all the cache server names we scan buckets to the right of h(x) until we find a bucket h(s) to which the name of some cache s hashes. Expected load on each of the n cache servers is exactly 1/n fraction of the objects Suppose we add a new cache server s which objects have to move? Only the objects stored at s. Adding the nth cache causes only a 1/n fraction of the objects to relocate This approach to consistent hashing can also be visualized on a circle

49 Consistent Hashing Standard hash table operations Lookup and Insert? Efficiently implementation of the rightward/clockwise scan for the cache server s that minimizes h(s) subject to h(s) h(x) We want a data structure for storing the cache names, with the corresponding hash values as keys, that supports a fast Successor operation Which data structure? Given a key we need the next key and its value. Not hash table, heap? How about binary search tree? Complexity is O(log n) Reducing the variance Expected load of each cache server is a 1/n fraction of the objects The realized load of each cache will vary An easy way to decrease this variance is to make k virtual copies of each cache s hashing with k different hash functions to get h1(s),..., hk(s) Each server now has k copies all of them are labeled s Objects are assigned as before from h(x), we scan clockwise until we encounter one of the hash values of some cache s Choosing k log 2 n is large enough to obtain reasonably balanced loads

50 Consistent Hashing Virtual copies are also useful for dealing with heterogeneous caches that have different capacities. Use the number of virtual copies of a cache server proportional to the server s capacity;

51 Literature Christof Strauch, NoSQL Databases. Ho, Ricky: NOSQL Patterns. November horicky.blogspot.com/2009/11/nosql-patterns.html Lipcon, Todd: Design Patterns for Distributed Non-Relational Databases, Presentation, static.last.fm/johan/nosql /intro_nosql.pdf Karger, David, Et.Al., Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, ACM symposium on Theory of computing, 1997 Microsoft Azure, Sharding pattern, Jun docs.microsoft.com/en-us/azure/architecture/patterns/sharding Tim Roughgarden, Gregory Valiant, CS168: The Modern Algorithmic Toolbox, Lecture #1: Introduction and Consistent Hashing, Stanford 2017.

52 Literature Cauchbase, Optimistic or pessimistic locking Which one should you pick?.

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini COMPSCI 590S Lecture 13 Goals of Key-Value Stores Export simple API put(key, value) get(key) Simpler and faster than a DBMS Less complexity,