}w!"#$%&'()+,-./012345<ya

Size: px
Start display at page:

Download "}w!"#$%&'()+,-./012345<ya"

Transcription

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345<ya Comparison of Java Frameworks for Distributed Application Development BACHELOR THESIS Šimon Hochla Brno, Spring 2015

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Šimon Hochla Advisor: Filip Nguyen, RNDr. ii

3 Acknowledgement I would like to express my graditude to my supervisor RNDr. Filip Nguyen for his support, suggestions and motivation for the elaboration of this thesis. iii

4 Abstract Aim of this work was to compare and assess some of the recent frameworks for development of distributed applications in Java. The chosen frameworks are Apache Zookeeper, JBoss Infinispan, Akka, Netty and JPaxos. The work compares the technologies from both the practical and theoretical aspect. It discusses the provided consistency, availability and resilience guarantees, programming interface, deployment process and the intended usage. In the practical part the individual technologies are demonstrated on implementing some of the agreement and coordination algorithms for a distributed environment. iv

5 Keywords Apache Zookeeper, Netty, JBoss Infinispan, JPaxos, Akka, Java, CAP theorem, distributed systems, two phase commit, three phase commit, leader election, distributed locking v

6 Contents 1 Introduction Frameworks overview Applicability, intended usage and software requirements Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Algorithms technologies are based on Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Resilience, consistency and availability CAP theorem Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Coordination and agreement in distributed systems Two phase commit Three phase commit Leader election Distributed locking Implementation of the algorithms File locking implementation Two phase commit Apache Zookeeper JBoss Infinispan Akka Netty Three phase commit Leader election vi

7 5.4.1 Apache Zookeeper Netty JBoss Infinispan Akka Distributed locking LOC and cyclomatic complexity analysis JPaxos Deployment on a cluster and test results Deployment details Apache Zookeeper JBoss Infinispan Netty Akka Performance resuts and their brief reasoning Subjective comparison of the technologies Apache Zookeeper JBoss Infinispan Netty Akka Conclusion vii

8 1 Introduction The area of distributed systems is known for being associated with many pitfalls. Besides a deep theoretical knowledge required from the programmers, development of distributed applications is error prone and can become very time demanding. This situations was an incentive for creation of tools that would help the programmers to focus on the concrete problems being solved instead of dealing with common tasks related to the distributed computing. Such tools facilitate the development providing an infrastructure for the runtime of a distributed system and a programming interface that is often very specific for the given technology. To the recent technologies of this kind with bindings to Java belong JBoss Infinispan, Apache Zookeeper, JPaxos and Akka. An alternative with a broader scope of applicability is the Netty framework. It enables development of network protocols and represents the technologies without an embedded support for distributed processing. The aim of this work is to compare and assess the above mentioned frameworks from both the theoretical aspect and the practical applicability for implementing distributed algorithms. In the theoretical part it is needed to analyse them from the point of consistency, availability and partition tolerance. Next task is to discuss their intended usage and the algorithms the technologies are implemented on. In the practical part the specific approaches of the given technologies will be demonstrated by implementing the following distributed algorithms. Two phase commit protocol and Three phase commit protocol applied on a distributed agreement for writing to a file, Leader election and Distributed locking for providing a unique access to a file. The algorithms will be deployed and tested on a cluster consisting of several physical machines. The work will next compare the performance results measured in the cluster and analyse the implementations using the metric lines of code and cyclomatic complexity. The second chapter brings an overview of the technologies with descriptions of their usage, programming interface and algorithms they rely on. In the third chapter they are analysed from the point of provided resilience, consistency and availability guarantees. The 1

9 1. INTRODUCTION fourth and fifth chapter is devoted to the specification of the selected algorithms and their implementations using the particular technologies. The Sixth chapter contains the deployment details and results from the testing. The last chapter summarizes the subjective opinions on the technologies acquired during the implementation process. 2

10 2 Frameworks overview 2.1 Applicability, intended usage and software requirements Apache Zookeeper Apache Zookeeper is an open source project of the Apache Software Foundation. Its purpose is to facilitate the development of distributed algorithms. A simple node architecture resembling a file system enables solving tasks requiring coordination. [JR13, 3] The typical use cases of Zookeeper are naming services (converting a name into a physical address on a network, configuration management (joining servers bootstrapping configuration without any incentive from the side of the centralized source), synchronization (implementation of algorithms like lock, barriers, queues), leader election, message queuing (enables applications communicating over network to send messages to queues and read them respectively). and notification systems. Zookeeper ships with C and Java API, the core library is written in Java and target production platforms are GNU/Linux, Sun Solaris and FreeBSD. Win32/Win64 and MacOSX can be used only for development Netty Netty is a framework providing tools for quick evolving of protocols for servers and clients by using an asynchronous event driven architecture. The technology is originally developed by JBoss, but now supplied as an opensource product of Netty Project Community. Netty is applicable everywhere where it is needed to designing new protocols. General purpose protocols are often used on places where a more narrowly focused version of the protocol would be preferable. Another case is when a legacy protocol cannot be used in the new environment. Netty is widely adopted by many organisations including almost every of the world s biggest software service providing companies. 3

11 2. FRAMEWORKS OVERVIEW The framework provides bindings to Java and can be used on both Windows and Linux platform JBoss Infinispan Infinispan is a data grid platform exposing interface for storing data in the form of key-value pairs. It guarantees a high availability of the service and provides a big scalability of the structure. The technology is developed by Red Hat, written in Java and provides bindings to Java. The most common place JBoss Infinispan is applied in are distributed caches in front of NoSQL databases, but it can also be used as a NoSQL store itself or just for the clusterability purposes in any frameworks Akka Akka is a toolkit and runtime for creating distributed message driven applications running on the JVM. It is developed by Typesafe Inc. and distributed as an opensource project. Above other programming models it provides actor-based concurrency 1. Akka is a robust technology offering wide scale of services for specific scenarios. It is used in big systems utilizing transaction processing mostly in gaming, finance and trading industry, in parallelism requiring tasks and many other places. Akka is a cross-platform framework and provides language bindings for Java and Scala. The core is written in Scala. The first release with support for Java comes from JPaxos JPaxos is a project developed in an university environment providing facilities for efficient state machine replication 2. It enables development of highly crash tolerant applications that permit a message loss and communication delays method for implementing a fault-tolerant service by replicating the service over a set of machines called replicas 4

12 2. FRAMEWORKS OVERVIEW JPaxos can be used as an experimental platform for research purposes in the service replication or in the commercial products respecting the LGPL3.0 licence 3. The design is based on solid theoretical foundations and represents the outcome of the currently conducted communication research. It is bundled as a Java library and the latest release comes from january of Algorithms technologies are based on Apache Zookeeper There is a single communication way between the participating serv ers and it is through a shared hierarchical namespace similar to a file system. This hierarchy consists of nodes called Znodes with a parent node, which is common to all ZNodes. The nodes can be identified by a path constructed using their respective parent nodes separated by a slash (/) with the root node at the beginning. Each ZNode except of the root has exactly one parent node and optionally its own set of children. Furthermore there is a small data structure associated with every ZNode, which can be used for storing any data up to 1 MB, but preferably much smaller. There are two types of ZNodes: persistent and ephemeral. Ephemeral nodes cease to exist after the client session is terminated, persistent nodes will remain in the system until they are deleted. When a node is created it must have a unique path assigned. This can be achieved by the functionality called sequence nodes that adds a monotonically increasing number to the end of the path if several nodes with the same name under common parent are created. [Reeb] The framework provides an event system called Watches, which can be utilized together with the ephemeral and sequential nodes for informing about a new client connection or disconnection. A watch is assigned to a specific node and can be triggered by its creation, deletion, modification or when there is a change in its children. ZooKeeper is a service replicated over a set of servers running in a distributed environment. These servers keep an in-memory copy of the data tree along with transaction logs and snapshots stored in

13 2. FRAMEWORKS OVERVIEW a persistent memory. Clients always connect only to a single Zoo- Keeper server. The client establishes a TCP connection with the given server and sends heartbeats informing about its availability. When the server fails, the client reconnects to a different one. The Zookeeper service is available as long as a strict majority of the servers are running. There is one master node, which is dynamically elected and through which all write operations are performed. The master guarantees that after a successful write operation there will be a majority of nodes keeping the recent value. Write operations in Zookeeper are expensive compared to read operations, because they request all the servers to perform the update operation while when reading only a single server is needed. Zookeeper is meant to be used in an environment with a very big number of nodes with mostly read operation and not sending large quantities of data Netty The main idea behind Netty is its asynchronous architecture. It uses operations that are strictly non-blocking, but share the same thread at the same time. This means that every method returns immediately rather than waiting for the result. What is returned instead is a ChannelFuture which can be listened on to find out whether the method succeeded, failed, or has been canceled. The communication unit sent servers is ChannelBuffer, interface providing abstract view for one or more primitive byte arrays. Netty includes a standard conversion for many types including String, HTTP and other, but if the user wants to use his own, he must provide both an encoder and a decoder. As the abstract representation of a socket capable of operations such as read, write, connect, and bind the Channel component is used. Above all, it provides the information whether it is in the connected or disconnected state and keeps a ChannelPipeline holding the channel handlers assigned to the channel. The main business logic is usually implemented in the class extending ChannelHandler. It contains methods, which the user can override and implement by his own need. One of them is channel- Active(ChannelHandlerContext ctx) triggered when a new 6

14 2. FRAMEWORKS OVERVIEW connection with another channel is established. ChannelHandler- Context holds a reference to that channel. Next one is channel- Read(ChannelHandlerContext ctx, Foo msg), which is triggered when some data is send to the channel. The message can be answered by calling the write method on the channel it was received from. ChannelPipeline passes the raw data to the first handler and returns it in the final form from the last one. This is the place also the encoders and decoders are engaged in. A Bootstrap instance starts and stops Netty applications. ServerBootstrap uses the method bind(), while Bootstrap the method connect() JBoss Infinispan JBoss Infinispan is based on cache objects providing a map like interface extending the java.util.map. These caches can be used for storing any key-value pairs of Object types. The caches are able to be persisted across the cluster in four ways. Local cache is the mode when the cache is stored only in the given node locally. The next one is invalidation mode, where all entries are saved in a cache store like a database. When a node wants to read a value, it will load it from the cache store storing the invalidated values. The third one is a replicated mode, when the caches are uniformly replicated over all of the nodes in the cluster. Finally, there is a distributed mode, where only a subset of all nodes is storing the given cache. This can be useful when it is wanted to achieve a certain point of fault tolerance, but retain scalability as well. For getting the access to the caches an instance of CacheManager must be created and passed an XML file providing configuration of the given caches. On this object it is possible to call the get method with the name of the requested cache. Now the cache can be used as an ordinary map object or it can be set a listener on watching the state of the cache. This listener can have several methods triggered depending on the type of the change in the cache. Above all, it is creating a new entry, removing an entry or its modification. Next, the user has got an option to set the listener not on the cache, but on the CacheManager. Here it is possible to watch the changes of the nodes structure triggering the event when either a 7

15 2. FRAMEWORKS OVERVIEW new node in the system is created or some of the nodes disconnected. This attributes to the nodes located possibly on a different machine in the cluster using one of the caches contained in the configuration file Akka Akka architecture is based on the Actor model defined in 1973 by Carl Hewitt popularized later by the Erlang language. Actors can be perceived as a higher level abstraction of objects. They are similar to a person abstraction in the way that every actor has its own place in the hierarchy and is responsible for the actors that are situated below him. They encapsulate their state and behaviour and communicate with their environment only through a single request-response method. The actors are implemented by extending the UntypedActor class that defines one abstract method onreceive(object message). This method determines the behavior of the actor. The received message is of the Object type, which enables sending data of any kind. It is possible to infer the original type by calling the getinstance- Of() method on the received object. Next methods accessible from the context of the method onreceive() are getsender() for getting the reference to the sender of the type ActorRef and a similar one, getself() to access the reference to the given actor. The actor reference provides a method tell(object message, ActorRef sender) for sending a message to the actor it is called on with the reference to the sender in the parameter. Another important part of the Akka functionality is hidden in the method getcontext() accessible also only from the onreceive() context. It exposes information for creating new child actors, getting the system that the actor belongs to, parent supervisor and supervised children. The other methods that can be overriden on the UntypedActor are methods prestart(), prerestart(), poststart(), poststop() for managing the actor in the different states of the lifecycle. Lifecycle Monitoring is the way to watch a node for its termination. There is a method watch(actorreftargetactorref) accessible from the ActorContext referencing a watched actor. 8

16 2. FRAMEWORKS OVERVIEW The last two functionalities, which cannot be omitted, are the routing and the remoting. Messages can be sent to a router, which is an actor that directs the messages to the destination actors. There are many possibilities how to route the messages. Akka contains routers providing support for load balancing, round robin mechanism, scatter gather or a simple broadcast. Remoting is the way to either lookup or create actors in a different ActorSystem running possibly on a separate machine. This can be achieved by changing the LocalActorRefProvider to and assigning the specified paths to be created in a remote host JPaxos JPaxos implements the Paxos distributed algorithm. It adheres to the basic principles of the protocol and introduces several optimizations by utilizing techniques like batching for performance improvements required from the modern systems. Paxos algorithm is used for making agreement on a value among a group of nodes. It guarantees that after one node sees a value that a majority of the nodes sees as well, the majority will never decide on a different value. Any change proposed by a node must therefore be first agreed on by the majority of the nodes. It requires from the nodes a strict ordering of the messages. All changes are assured to be applied in a certain order on all the nodes in the system. JPaxos provides three types of instances the end user is working with. Service interface upon which the user provided functionality is defined. The interface specifies methods for the interaction with the JPaxos system. Next one is the Replica class, which when deployed is bound to the implementation of the Service. The last one is the Client class, which is used for sending client s requests to the service. When the Replica instance is created, it is passed a Service instance, configuration class and an unique id for the identification of the replica among other replicas. There are usually several replicas started in the system doing the same job just adding the robustness of the system. Akka provides several types of Service interfaces with a different degree of abstraction. The most common one is the SimplifiedService operating with byte arrays and SerializableService, which performs 9

17 2. FRAMEWORKS OVERVIEW serialization automatically. Above all, they define a method for the execution of the request-response communication that accepts the client data and returns a response back to the client. Next two abstract methods, the user needs to implement, are for creating a snapshot and updating to a certain version of a snapshot. A snapshot consists of the data the service persists between the individual executions. Client requires upon creation a configuration file with the specifications of the system containing the host addresses of the replicas. Client provides two methods, for connecting to the system and executing requests. It connects to the replicas dynamically, when the one it is connected to crashes, other one will be used instead. 10

18 3 Resilience, consistency and availability 3.1 CAP theorem CAP theorem states that it is impossible for a distributed computer system to provide all three of the following guarantees at the same time: Consistency, Availability and Partition tolerance.[sg] Consistency There must be a total order on all operations performed in the system. Any node can see exactly the same state of the requested resource. Availability For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.[lam86] It means that any finite algorithm on a distributed system must eventually terminate. There is no restriction on how long it will take, only that it will happen in a finite time. Partition tolerance System continues to work after losing arbitrary number of messages. Any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost. There has been a lot of debate on the topic of which combination of these three guarantees is the most preferable. The conclusion is that in the systems build upon real world networks it is almost impossible to find a case where no messages will ever be lost and no nodes will fail. Eventually, one ends up with decision between the two Availability and Consistency. Either way is a certain tradeoff with its consequences. In the case when the consistency is chosen over availability it is needed to cope with the issues about the perceived liveliness of the system from the point of the client. When there are any writes in the time of the system unavailability they must be buffered so that the operation could proceed after the temporary partition is resolved. This can also lead to an inconsistency when the buffer fails and the writes are lost. The other strategy is to inform client with an error message and request him to wait until the system is available 11

19 3. RESILIENCE, CONSISTENCY AND AVAILABILITY again. When it is decided for the availability over consistency there can possibly happen that multiple readers get different results of the same property. It is because every reader receives just the most recent value that the node it reads from can provide not the last one value written in the whole system. This requires the application programmers developing their services upon the system to cope with these inconsistencies and try to provide eventual consistency 1. It is up to programmers to update the value to the most current state after a divergence has been detected, which requires history tracking and update merging. 3.2 Apache Zookeeper Apache Zookeeper provides the following consistency guarantees: atomicity updates either succeed or fail sequential consistency updates are applied in order reliability updates persist once applied single system image a client sees the same view of the service regardless of the ZK server it connect to timeliness the client s view of the system is guaranteed to be up-todate within a certain time bound (eventual consistency) [Reea] Zookeeper tolerates partitions up to n failed server nodes from an ensemble consisting originally of at least 2n+1 servers. These n+1 servers are called quorum. There are two situations that can happen. When the partition occurs the current leader can fall either into the part holding the quorum and the system will continue normally or into the non quorum part and then a new leader must be elected. After the non quorum part rejoins back it updates the values created in the time of its disconnection. [fai] 1. Eventual consistency is a consistency model that guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value 12

20 3. RESILIENCE, CONSISTENCY AND AVAILABILITY 3.3 Netty Netty is a representation of a technology, which does not provide any support for resuming after a server or client disconnection and it is up to the developer to implement this behaviour. When a channel is closed, meaning that a client that was previously connected to the server has terminated, the method channelinactive() will be triggered. As Netty is purely asynchronous, all operations are performed without waiting for the return value. What can be utilized to get the information about the success or unsuccess is the object returned by every operation called ChannelFuture. It is a handle to the event related to the operation it is returned by. A listener can be set to the ChannelFuture for getting informed about the result of the operation. That the operation completed does not mean that it succeeded. In fact there are four possible results: Success, Failed, Timed out, Cancelled. 3.4 JBoss Infinispan Infinispan is described to be leaning towards providing Consistency and Availability, sacrificing Partition tolerance. However, Infinispan offers option to disable or enable partition handling. In case of the partition handling disabled the partitioned node is assumed to be crashed and to rejoin the cluster is must be restarted. It starts again with a different JGroups address and without holding any data. When the cluster is separated into two partitions where there are any unsynchronized updates, after joining together, the partition with the larger number of nodes will overwrite the data hold by the smaller partition. In the other case a node after disconnection merges back to the cluster without a restart. During a merge, there are five possible scenarios which can occur. There is a complex theory behind the process of merging in Infinispan. It is always trying to resolve the inconsistencies and make the best from the current state provided. The process also depends on the mode the caches are set to and only the eventual consistency can be assumed. What in Infinispan is always 13

21 3. RESILIENCE, CONSISTENCY AND AVAILABILITY at the first place is the availability of the system. 3.5 Akka The availability and partition tolerance guarantees are bundled together. Akka uses vector clocks for partial ordering of events. In each update to the cluster state the vector clock is updated to. For broadcasting the message with the current state of the system the Gossip protocol is used, which is sends the information in a randomized fashion with preference to the nodes that have not seen the last version. Gossip convergence is the process utilizing this protocol finishing when all the nodes see the same view of the system. Throughout the gossip convergence the system continues on and only the membership management is affected. Any new nodes wanting to join the cluster must wait until the gossip convergence is done. For the partition detection Akka uses The Phi Accrual Failure Detector. It determines whether the node is actually down or it is just a temporary disconnection from the history of the behaviour of the node. The threshold for deciding for either of the states is configurable. This can be used to provide the eventual consistency.[akk] 3.6 JPaxos JPaxos supports two modes of operation. Basic mode is able to tolerate n replica breakdowns when the system is deployed on at least n+1 machines. The extended mode uses a non-volatile memory for persisting snapshots so any number of the replicas can corrupt and the system will be able to recover. The time period in which the stored snapshot is refreshed is determined by the selected crash model. It can be persisted on the start of the replica, periodically or after every state change that is suitable only for highly crash prone systems because of the high demands on the performance. The Paxos algorithms guarantees strict ordering of the updates so the state of a replica is kept consistent to other replicas. 14

22 4 Coordination and agreement in distributed systems This chapter brings a theoretical background for the algorithms implemented in this work. The decision for the given algorithms is discussed in the beginning and then follow the individual definitions of the algorithms. First two of these algorithms are Two phase commit and Three phase commit protocols which are used for a distributed agreement about a globally applied change in a distributed system. Next algorithm is Leader election for deciding on a leader in a group of members. The last one is Distributed locking for synchronizing access to shared resources. These algorithms were chosen because of their simplicity and the fact that they have very a long tradition in the area of distributed systems. There are more sophisticated variants of them out there, but aim of this work is to demonstrate the pchde these functions. 4.1 Two phase commit Two phase commit (2PC) algorithm is a way to decide on approving or cancelling a new transaction in a group of nodes. There is one coordinator node, which manages the transaction, and several participants that can either agree or disagree with the proposed transaction. The algorithm consists of two phases, a voting phase and a commit phase. In the voting phase the coordinator sends a request to each participant. The participants decide to commit or abort the transaction, log their decision, lock the shared resources in case of commit, and send their votes back to the coordinator. In the commit phase the coordinator makes a final decision based on the collected votes and sends it to all of the participants. The decision is commit in case when all participants have commited the transaction and abort in any other case. Then a participants either proceeds with the transaction, in the case of commit, or rollbacks the transactions, when the decision is abort, and sends an acknowl- 15

23 4. COORDINATION AND AGREEMENT IN DISTRIBUTED SYSTEMS edgement back to the coordinator. When the coordinator collects acknowledgements from all of the participants the transaction is done. [PAB09] 4.2 Three phase commit Three phase commit (3PC) is a refinement of the 2PC algorithm that guarantees a strong termination when utilizing timeouts. The system cannot get to a blocking state, when all nodes are waiting for one crashed node to recover. It can happen in 2PC when both the coordinator and one of the participants fail at the same time. This is provided by adding a third phase between the voting and commit phase called precommit phase. It follows after all the participants have agreed with the transaction and before the coordinator sends the decision to commit. [IK] 4.3 Leader election This is the most platform dependent algorithm meaning it can be achieved in various way depending on the technology it is implemented on. There are three rules, which the implementation must obey. The algorithm must terminate in a finite time, only one node is selected as a leader and every other node is informed about the new leader after it has been elected. [IG00] 4.4 Distributed locking Distributed locking is needed anytime there is a shared resource replicated over a set of separate nodes which needs to be accessed only by one of the members at a time. There are basically two methods, which need to be implemented. The lock method, which waits until the resource is available and then provides a unique access and the unlock method for releasing the locked resources. 16

24 5 Implementation of the algorithms This chapter describes the implementation strategies specific for the chosen technologies when applied on the algorithms from the chapter four. There is a description of the file locking mechanism common to all implementations in the beginning followed by the independent implementations categorised by the selected algorithms. 5.1 File locking implementation For the demonstration of the possibility to achieve a synchronized access to shared resources by using the selected algorithms, the implementation provides a mechanism for a distributed file locking. There are these suppositions related to the locking mechanism. No other process should be able to delete or modify the file after assigning a lock in the given process and only this process should have the privileges to perform the write operations during that time. The lock will be released when the process either terminates or calls the unlock method. FileLock API from java.nio.channels provides just the abovementioned functionality and is platform independent so it is used in the file locking implementation located in the class LockFileDemo. There are three methods defined in LockFileDemo. LockFile(), releaselock() and writetofile(string data). The first one tries to assign an exclusive lock for the file specified as a global variable. If the file is already locked by a different JVM, the process exits with error, because it is an unexpected usage. Otherwise it assigns the newly acquired lock to a global variable holding the reference to the lock. WriteToFile() simply appends the given string to the instance of java.io.random AccessFile opened by the lockfile(). The lock can then be released by the method releaselock(), which informs the user when the lock has not been defined or is invalid. Furthermore, releaselock() closes the file reference created by lockfile(), so it is not possible to use the method writetofile() any more. 17

25 5.2 Two phase commit Apache Zookeeper 5. IMPLEMENTATION OF THE ALGORITHMS The coordinator using the Zookeeper node system creates a transaction node under which the participating sites create their respective child nodes with the flags EPHEMERAL and SEQUENTIAL when they join the system. Each then site sets a watch on the change of the coordinator s node data content. The coordinator waits until the expected number of sites joined and then asks them to give their votes in the way it writes the request to its node s data content. After a participant decides commit or abort it sends its vote to the coordinator by writing it to the its respective node.the coordinator watches for any changes of its children in a loop and waits until either one of the participants votes abort or all sites have voted commit. Then it writes the final result to the transaction node. The sites send their acknowledgements after receiving the result back to the coordinator in the same way as they provided their votes. After the coordinator collects all the acknowledgements, the transaction is done JBoss Infinispan There are two replicated Infinispan caches used. The coordinator cache for storing either the coordinator request or decision and its value and the sites cache where the participants addresses are mapped to string values. First the sites are started. After a site starts, it opens the coordinator and sites cache and sets a listener on the change of the coordinator cache. The coordinator is started no sooner than all sites, which want to participate, have opened their session. The coordinator collects the addresses of the remote cache managers using the sites cache and puts one by one into the sites cache mapped to an empty string. Then it requests the sites to give their votes by writing the transaction request to the coordinator cache under the key request and sets a listener on the sites cache. Now every site finds its respective address in the sites cache key set and writes its transaction decision as the value under its address. The coordinator listener 18

26 5. IMPLEMENTATION OF THE ALGORITHMS is triggered when each site gives its vote and waits until all sites have provided their votes. Then it decides for the final result and propagates it to sites by writing it to the coordinator cache under the key decision. After all the site have acknowledged receiving the result in the same way as they voted, the transaction is done Akka The implementation uses two functionalities provided by Akka, remoting and routing. First all the participants are started. They create their actor systems and then wait for the request from the coordinator. The coordinator after setting up his own actor system creates the participant nodes for all sites participating in the transaction. Here the remoting is used, all creations are redirected according to the configuration to the specific remote actor systems. Now the coordinator creates his own Actor and in the pre- Start() method it requests the participants for their votes. This is done by using the routing functionality. There is a router actor of the type BroadCastGroup created in the coordinator s actor system, which redirects all messages to other actors specified in a list assigned by his creation. This list contains the paths to the participant nodes. Now the participants and the coordinator communicate in the way of request-response using the method onreceive(object message) from which calling the getsender() is used for tracking back the sender address and sending the response. After the coordinator collects the messages with the decisions from the participating sites, it decides the results, sends it back to all sites and waits for their acknowledgements Netty When implementing a new protocol in Netty it is advised to start upon one of the examples from the Netty repository. In this implementation the Telnet protocol is used. There is one server representing the coordinator and several clients for the participants, which connect to the server on the start. When a connection to the server is initiated, the servers method channelactive(channelhandlercontext ctx) is triggered. 19

27 5. IMPLEMENTATION OF THE ALGORITHMS From the ChannelHandlerContext the channel of the client is acquired and put to a variable of type the ChannelGroup collecting all the connected clients. When the size of the group reaches the desired number of participants, the coordinator sends the request for the vote to each participant. The communication then continues in the form of request-response using the same principle as in Akka. 5.3 Three phase commit Three phase commit protocol works in the very same way as the two phase commit protocol with the only difference that it adds the second commiting phase. The transition between the implementations is therefore straightforward. Instead of the request commit sent by the coordinator, two types of requests are used: precommit and docommit. On the precommit the participant responds with ACK. After having collected acknowledgements from all the participants the coordinator sends the final result docommit. The participant then comes to the commited state and sends the second acknowledgement havecommited. The transaction is finished after the coordinator collects the acknowledgements from all participants. However, the power of the three phase commit does not reveal until having implemented the timeouts. Real world implementations are able to manage response expirations to handle crashes of either the coordinator or the participants. In this work just the basic functionality of the protocol is implemented and the timeouts management would make for another chapter. 5.4 Leader election The aim is to solve the problem of selecting one node from a group of nodes, inform the other nodes about which nodes has been chosen and repeat the process when the selected node fails. 20

28 5.4.1 Apache Zookeeper 5. IMPLEMENTATION OF THE ALGORITHMS Leader election in Apache Zookeeper can be implemented in a very simple manner. Suppose a ZNode called /election is created under which each election candidate appends a new node with EPHE- MERAL and SEQUENTIAL flags. The name of the node will be automatically appended with an index of a value greater than of the sibling node with the largest index before it was created. The process that created the node with the smallest index is the leader. At the time the leader process terminates, its respective node is deleted automatically because it was created as an EPHEMERAL node. This is the event other nodes must be watching for. To avoid the unnecessary herd effect only one node will be notified and it is the node with the next minimal index after the node that terminated. This process continues until there there are no more nodes under the /election Netty Here the same architecture is used as in the two phase commit with one server as the coordinator and several clients representing the election candidates. When each client joins the coordinator, its respective Channel is added to the pool collecting all connected clients. After the given number of candidates have connected to the server the method electleader() is triggered on the coordinator which determines the leader and then sends the election result to all the participants. The election method is based on choosing the client communicating on the channel with the minimal id. When a member gets the message that it has been selected as the leader it starts the leader procedure and after it has finished the member terminates. This event is watched by the listener implementing ChannelFutureListener located in the coordinator. The leader is removed from the pool of candidates and a new leader is elected. This repeats until the pool is empty. 21

29 5. IMPLEMENTATION OF THE ALGORITHMS JBoss Infinispan In this implementation two caches are used: electablememberscache and leadercache. ElectableMembersCache persists keys of the Long type designing the time when the given member acquired for the leadership mapped to the addresses of the nodes. The addresses are acquired in the same fashion as in the Two phase commit, they represent the address of the remote EmbeddedCacheManager of the respective node. There is one public method becomeelectable()s called on the start for joining to the electable members. First the pair of the current time and the address of the node entering the group of the electables is put to the electablememberscache. Followingly, the minimal value from the key set of electablememberscache is acquired for determining the node that joined before any other node. This minimal index is compared to the key under the current node has been saved and if they equal the node becomes the leader and performs the leader procedure. Otherwise it sets the listener on the change of the Embedded- CacheManager for watching the changes of the nodes topology. The event after triggering the listener holds two values: set of the old members before the listener has been notified and set of the new members after the topology change. For getting the members that disconnected in the current change it is needed to subtract the new members from the old ones. Now for every node that has disconnected its respective key value pair from the electablememberscache needs to be deleted. In the end of the listener the new leader election is performed in the way that the member with the currently minimal index becomes the leader. This repeats until the electablememberscache is empty Akka Leader election in Akka uses the module Akka Cluster that needs to be added as a project s dependency. The actor representing the electable member listens in the method onreceive(object message) for the cluster changes by waiting for three types of messages. CurrentClusterState sent to 22

30 5. IMPLEMENTATION OF THE ALGORITHMS the subscriber when it initiates the session, MemberUp when a new member joins the group of electables and MemberRemoved when a member disconnects. After receiving the CurrentClusterState the members from the current state are added to a sorted set using the age comparator. Every member of the type Member owns the built in function isolderthan(member m) that can be utilized in the comparator. The method currentmaster() then returns the first member from the set of members, which is the leader. This implementation has been adopted from the Akka samples because it perfectly demonstrates the built-in infrastructure for the membership handling. 5.5 Distributed locking The locking part of the algorithm can be achieved in the very same way as the process of the leader election. The differences are seen only in the behaviour after the member has been selected. The method for becoming electable simply translates to the method for acquiring the lock. Instead of performing the leader procedure all the members are informed by the selected member that it has acquired the lock. Now the member can perform a method requiring exclusive access to some resources. After it is done, it calls the unlock method, in which it informs the other members that it has released the resources and the node can continue to work without terminating. The unlock method can be implemented in various ways depending on the technology. In JBoss Infinispan a new type of listener is used. Instead of listening on the change of the node topology it is listened on the removal of the entry in the cache with the reference to the node holding the lock. In Zookeeper the members simply watch for the removal of the ZNode belonging to the process holding the lock. In Netty instead of closing the session after finishing the leader procedure a message is sent back to the coordinator, which consequently deletes the given member from the watched channels and performs a new election without considering it. 23

31 5.6 JPaxos 5. IMPLEMENTATION OF THE ALGORITHMS JPaxos framework was eventually excluded from the group of technologies used in the implementation part and that from the following reasons. State machine replication is a theory that is applicable in use cases when one server instance is wanted to be replicated over a set of machines. However, in case of these algorithms, the separate servers need to keep their own independent state for making individual decisions on the transaction result in the 2PC and 3PC protocols. This in JPaxos turns out to be impossible because the replica and service are bundled together and just than replicated. In the leader election a special flag for designating the leader is needed again to be applied only on one of the servers. Next reason is the fact that the technology is slowly becoming outdated since it has not been updated for almost two and half years. The documentation is currently largely unfinished and there are issues in the library still waiting to be resolved. 5.7 LOC and cyclomatic complexity analysis This part compares the implementations from the point of two metrics, lines of code and cyclomatic complexity. The results are organised in a table. 24

32 5. IMPLEMENTATION OF THE ALGORITHMS Figure 5.1: Two phase commmit Figure 5.2: Three phase commit 25

33 5. IMPLEMENTATION OF THE ALGORITHMS Figure 5.3: Leader election Figure 5.4: Distributed locking 26

34 6 Deployment on a cluster and test results This part is devoted to specifying the configurations needed to be made when deploying the given technologies on a cluster. The user experience from working with a certain technology for distributed processing is largely affected by the easiness of its deployment on the cluster. There are several ways of looking up the cluster members via the network. Every time it is a combination of a host in the form of an IP address and a port the node is listening on. But not all the technologies use the same concept. Some need to be configured only on the client side and then the server responds back by inferring the client address automatically. Other use several servers, which need to know of each of themselves and when the configuration changes, each server must be reconfigured manually. The last method is utilizing a multicast host which is a very flexible approach and allows to change the nodes structure anytime. 6.1 Deployment details Apache Zookeeper The cluster formed by remote servers in Zookeeper is called a Zookeeper ensemble. Each machine in the ensemble must be configured to know about every other machine. The configuration of a Zookeeper server consists of the following steps. First it is needed to install Java JDK and set the appropriate Java heap size so that the Zookeeper server was able to run without unnecessary swapping. Then follows the installation of the ZooKeeper Server Package by downloading and unpacking it to a specified directory. The last thing required is creating a configuration file located in the server directory and setting up at least these parameters. Path to the data directory for saving snapshots of the system and the file with the id of the server, client port for the clients that want to connect to the server and list of the servers the ensemble consists of in the form server.id=host:port:port. The id is a unique identifier of the server and there is a corresponding file in the data directory with the name of the id. Host is the address of the given 27

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals... 2 1.2 Data model and the hierarchical namespace... 3 1.3 Nodes and ephemeral nodes...

More information

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems ZooKeeper & Curator CS 475, Spring 2018 Concurrent & Distributed Systems Review: Agreement In distributed systems, we have multiple nodes that need to all agree that some object has some state Examples:

More information

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015 Distributed Coordination with ZooKeeper - Theory and Practice Simon Tao EMC Labs of China {simon.tao@emc.com} Oct. 24th, 2015 Agenda 1. ZooKeeper Overview 2. Coordination in Spring XD 3. ZooKeeper Under

More information

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering Agreement and Consensus SWE 622, Spring 2017 Distributed Software Engineering Today General agreement problems Fault tolerance limitations of 2PC 3PC Paxos + ZooKeeper 2 Midterm Recap 200 GMU SWE 622 Midterm

More information

ZooKeeper Recipes and Solutions

ZooKeeper Recipes and Solutions by Table of contents 1 A Guide to Creating Higher-level Constructs with ZooKeeper...2 1.1 Important Note About Error Handling... 2 1.2 Out of the Box Applications: Name Service, Configuration, Group Membership...2

More information

Comparative Analysis of Big Data Stream Processing Systems

Comparative Analysis of Big Data Stream Processing Systems Comparative Analysis of Big Data Stream Processing Systems Farouk Salem School of Science Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 22 June, 2016 Thesis

More information

CMPT 435/835 Tutorial 1 Actors Model & Akka. Ahmed Abdel Moamen PhD Candidate Agents Lab

CMPT 435/835 Tutorial 1 Actors Model & Akka. Ahmed Abdel Moamen PhD Candidate Agents Lab CMPT 435/835 Tutorial 1 Actors Model & Akka Ahmed Abdel Moamen PhD Candidate Agents Lab ama883@mail.usask.ca 1 Content Actors Model Akka (for Java) 2 Actors Model (1/7) Definition A general model of concurrent

More information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5. Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message

More information

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich Apache ZooKeeper and orchestration in distributed systems Andrew Kondratovich andrew.kondratovich@gmail.com «A distributed system is one in which the failure of a computer you didn't even know existed

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

ZooKeeper Recipes and Solutions

ZooKeeper Recipes and Solutions by Table of contents 1 A Guide to Creating Higher-level Constructs with ZooKeeper...2 1.1 Out of the Box Applications: Name Service, Configuration, Group Membership... 2 1.2 Barriers... 2 1.3 Queues...

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Distributed Data Management Replication

Distributed Data Management Replication Felix Naumann F-2.03/F-2.04, Campus II Hasso Plattner Institut Distributing Data Motivation Scalability (Elasticity) If data volume, processing, or access exhausts one machine, you might want to spread

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

Scaling out with Akka Actors. J. Suereth

Scaling out with Akka Actors. J. Suereth Scaling out with Akka Actors J. Suereth Agenda The problem Recap on what we have Setting up a Cluster Advanced Techniques Who am I? Author Scala In Depth, sbt in Action Typesafe Employee Big Nerd ScalaDays

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Linearizability CMPT 401. Sequential Consistency. Passive Replication Linearizability CMPT 401 Thursday, March 31, 2005 The execution of a replicated service (potentially with multiple requests interleaved over multiple servers) is said to be linearizable if: The interleaved

More information

Project Midterms: March 22 nd : No Extensions

Project Midterms: March 22 nd : No Extensions Project Midterms: March 22 nd : No Extensions Team Presentations 10 minute presentations by each team member Demo of Gateway System Design What choices did you make for state management, data storage,

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems Transactions CS 475, Spring 2018 Concurrent & Distributed Systems Review: Transactions boolean transfermoney(person from, Person to, float amount){ if(from.balance >= amount) { from.balance = from.balance

More information

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems Consistency CS 475, Spring 2018 Concurrent & Distributed Systems Review: 2PC, Timeouts when Coordinator crashes What if the bank doesn t hear back from coordinator? If bank voted no, it s OK to abort If

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

Exam 2 Review. October 29, Paul Krzyzanowski 1

Exam 2 Review. October 29, Paul Krzyzanowski 1 Exam 2 Review October 29, 2015 2013 Paul Krzyzanowski 1 Question 1 Why did Dropbox add notification servers to their architecture? To avoid the overhead of clients polling the servers periodically to check

More information

Trek: Testable Replicated Key-Value Store

Trek: Testable Replicated Key-Value Store Trek: Testable Replicated Key-Value Store Yen-Ting Liu, Wen-Chien Chen Stanford University Abstract This paper describes the implementation of Trek, a testable, replicated key-value store with ZooKeeper-like

More information

GFS-python: A Simplified GFS Implementation in Python

GFS-python: A Simplified GFS Implementation in Python GFS-python: A Simplified GFS Implementation in Python Andy Strohman ABSTRACT GFS-python is distributed network filesystem written entirely in python. There are no dependencies other than Python s standard

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

The Chubby Lock Service for Loosely-coupled Distributed systems

The Chubby Lock Service for Loosely-coupled Distributed systems The Chubby Lock Service for Loosely-coupled Distributed systems Author: Mike Burrows Presenter: Ke Nian University of Waterloo, Waterloo, Canada October 22, 2014 Agenda Distributed Consensus Problem 3

More information

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS 2013 Long Kai EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS BY LONG KAI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3

More information

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS Marius Rafailescu The Faculty of Automatic Control and Computers, POLITEHNICA University, Bucharest ABSTRACT There are many distributed systems which

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions Transactions Main issues: Concurrency control Recovery from failures 2 Distributed Transactions

More information

Dynamic Reconfiguration of Primary/Backup Clusters

Dynamic Reconfiguration of Primary/Backup Clusters Dynamic Reconfiguration of Primary/Backup Clusters (Apache ZooKeeper) Alex Shraer Yahoo! Research In collaboration with: Benjamin Reed Dahlia Malkhi Flavio Junqueira Yahoo! Research Microsoft Research

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

CSE 5306 Distributed Systems. Consistency and Replication

CSE 5306 Distributed Systems. Consistency and Replication CSE 5306 Distributed Systems Consistency and Replication 1 Reasons for Replication Data are replicated for the reliability of the system Servers are replicated for performance Scaling in numbers Scaling

More information

SimpleChubby: a simple distributed lock service

SimpleChubby: a simple distributed lock service SimpleChubby: a simple distributed lock service Jing Pu, Mingyu Gao, Hang Qu 1 Introduction We implement a distributed lock service called SimpleChubby similar to the original Google Chubby lock service[1].

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

This tutorial will give you a quick start with Consul and make you comfortable with its various components.

This tutorial will give you a quick start with Consul and make you comfortable with its various components. About the Tutorial Consul is an important service discovery tool in the world of Devops. This tutorial covers in-depth working knowledge of Consul, its setup and deployment. This tutorial aims to help

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju Chapter 4: Distributed Systems: Replication and Consistency Fall 2013 Jussi Kangasharju Chapter Outline n Replication n Consistency models n Distribution protocols n Consistency protocols 2 Data Replication

More information

ZooKeeper. Wait-free coordination for Internet-scale systems

ZooKeeper. Wait-free coordination for Internet-scale systems ZooKeeper Wait-free coordination for Internet-scale systems Patrick Hunt and Mahadev (Yahoo! Grid) Flavio Junqueira and Benjamin Reed (Yahoo! Research) Internet-scale Challenges Lots of servers, users,

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Replication and consistency Fall 2013 1 Replication 2 What is replication? Introduction Make different copies of data ensuring that all copies are identical Immutable data

More information

Apache Zookeeper. h,p://zookeeper.apache.org

Apache Zookeeper. h,p://zookeeper.apache.org Apache Zookeeper h,p://zookeeper.apache.org What is a Distributed System? A distributed system consists of mulaple computers that communicate through a computer network and interact with each other to

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Consistency and Replication Jia Rao http://ranger.uta.edu/~jrao/ 1 Reasons for Replication Data is replicated for the reliability of the system Servers are replicated for performance

More information

F5 BIG-IQ Centralized Management: Local Traffic & Network. Version 5.2

F5 BIG-IQ Centralized Management: Local Traffic & Network. Version 5.2 F5 BIG-IQ Centralized Management: Local Traffic & Network Version 5.2 Table of Contents Table of Contents BIG-IQ Local Traffic & Network: Overview... 5 What is Local Traffic & Network?... 5 Understanding

More information

Batches and Commands. Overview CHAPTER

Batches and Commands. Overview CHAPTER CHAPTER 4 This chapter provides an overview of batches and the commands contained in the batch. This chapter has the following sections: Overview, page 4-1 Batch Rules, page 4-2 Identifying a Batch, page

More information

A Reliable Broadcast System

A Reliable Broadcast System A Reliable Broadcast System Yuchen Dai, Xiayi Huang, Diansan Zhou Department of Computer Sciences and Engineering Santa Clara University December 10 2013 Table of Contents 2 Introduction......3 2.1 Objective...3

More information

Modern Database Concepts

Modern Database Concepts Modern Database Concepts Basic Principles Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz NoSQL Overview Main objective: to implement a distributed state Different objects stored on different

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4

More information

BookKeeper overview. Table of contents

BookKeeper overview. Table of contents by Table of contents 1...2 1.1 BookKeeper introduction...2 1.2 In slightly more detail...2 1.3 Bookkeeper elements and concepts... 3 1.4 Bookkeeper initial design... 3 1.5 Bookkeeper metadata management...

More information

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load

More information

Consistency in Distributed Systems

Consistency in Distributed Systems Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in scale and widely distributed 1. concurrent execution of components 2. independent failure modes 3. transmission

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Topics in Reliable Distributed Systems

Topics in Reliable Distributed Systems Topics in Reliable Distributed Systems 049017 1 T R A N S A C T I O N S Y S T E M S What is A Database? Organized collection of data typically persistent organization models: relational, object-based,

More information

Bull. HACMP 4.4 Programming Locking Applications AIX ORDER REFERENCE 86 A2 59KX 02

Bull. HACMP 4.4 Programming Locking Applications AIX ORDER REFERENCE 86 A2 59KX 02 Bull HACMP 4.4 Programming Locking Applications AIX ORDER REFERENCE 86 A2 59KX 02 Bull HACMP 4.4 Programming Locking Applications AIX Software August 2000 BULL CEDOC 357 AVENUE PATTON B.P.20845 49008

More information

Distributed Consensus Protocols

Distributed Consensus Protocols Distributed Consensus Protocols ABSTRACT In this paper, I compare Paxos, the most popular and influential of distributed consensus protocols, and Raft, a fairly new protocol that is considered to be a

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

Time and Space. Indirect communication. Time and space uncoupling. indirect communication

Time and Space. Indirect communication. Time and space uncoupling. indirect communication Time and Space Indirect communication Johan Montelius In direct communication sender and receivers exist in the same time and know of each other. KTH In indirect communication we relax these requirements.

More information

Broker Clusters. Cluster Models

Broker Clusters. Cluster Models 4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message

More information

RFC 003 Event Service October Computer Science Department October 2001 Request for Comments: 0003 Obsoletes: none.

RFC 003 Event Service October Computer Science Department October 2001 Request for Comments: 0003 Obsoletes: none. Ubiquitous Computing Bhaskar Borthakur University of Illinois at Urbana-Champaign Software Research Group Computer Science Department October 2001 Request for Comments: 0003 Obsoletes: none The Event Service

More information

Coordination and Agreement

Coordination and Agreement Coordination and Agreement Nicola Dragoni Embedded Systems Engineering DTU Informatics 1. Introduction 2. Distributed Mutual Exclusion 3. Elections 4. Multicast Communication 5. Consensus and related problems

More information

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication CSE 444: Database Internals Section 9: 2-Phase Commit and Replication 1 Today 2-Phase Commit Replication 2 Two-Phase Commit Protocol (2PC) One coordinator and many subordinates Phase 1: Prepare Phase 2:

More information

CS October 2017

CS October 2017 Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.

More information

Master s Thesis. A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems

Master s Thesis. A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems Master s Thesis Title A Construction Method of an Overlay Network for Scalable P2P Video Conferencing Systems Supervisor Professor Masayuki Murata Author Hideto Horiuchi February 14th, 2007 Department

More information

Assignment 12: Commit Protocols and Replication Solution

Assignment 12: Commit Protocols and Replication Solution Data Modelling and Databases Exercise dates: May 24 / May 25, 2018 Ce Zhang, Gustavo Alonso Last update: June 04, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 12: Commit Protocols and Replication

More information

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira

More information

Synchronization. Chapter 5

Synchronization. Chapter 5 Synchronization Chapter 5 Clock Synchronization In a centralized system time is unambiguous. (each computer has its own clock) In a distributed system achieving agreement on time is not trivial. (it is

More information

CS505: Distributed Systems

CS505: Distributed Systems Cristina Nita-Rotaru CS505: Distributed Systems Protocols. Slides prepared based on material by Prof. Ken Birman at Cornell University, available at http://www.cs.cornell.edu/ken/book/ Required reading

More information

Indirect Communication

Indirect Communication Indirect Communication Vladimir Vlassov and Johan Montelius KTH ROYAL INSTITUTE OF TECHNOLOGY Time and Space In direct communication sender and receivers exist in the same time and know of each other.

More information

416 practice questions (PQs)

416 practice questions (PQs) 416 practice questions (PQs) 1. Goal: give you some material to study for the final exam and to help you to more actively engage with the material we cover in class. 2. Format: questions that are in scope

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Copyright 2003 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

Implementation and Performance of a SDN Cluster- Controller Based on the OpenDayLight Framework

Implementation and Performance of a SDN Cluster- Controller Based on the OpenDayLight Framework POLITECNICO DI MILANO Dipartimento di Elettronica, Informazione e Bioingegneria Master of Science in Telecommunication Engineering Implementation and Performance of a SDN Cluster- Controller Based on the

More information

Trinity File System (TFS) Specification V0.8

Trinity File System (TFS) Specification V0.8 Trinity File System (TFS) Specification V0.8 Jiaran Zhang (v-jiarzh@microsoft.com), Bin Shao (binshao@microsoft.com) 1. Introduction Trinity File System (TFS) is a distributed file system designed to run

More information

<Insert Picture Here> QCon: London 2009 Data Grid Design Patterns

<Insert Picture Here> QCon: London 2009 Data Grid Design Patterns QCon: London 2009 Data Grid Design Patterns Brian Oliver Global Solutions Architect brian.oliver@oracle.com Oracle Coherence Oracle Fusion Middleware Product Management Agenda Traditional

More information

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University CS535 Big Data - Fall 2017 Week 6-A-1 CS535 BIG DATA FAQs PA1: Use only one word query Deadends {{Dead end}} Hub value will be?? PART 1. BATCH COMPUTING MODEL FOR BIG DATA ANALYTICS 4. GOOGLE FILE SYSTEM

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

Indirect Communication

Indirect Communication Indirect Communication Today l Space and time (un)coupling l Group communication, pub/sub, message queues and shared memory Next time l Distributed file systems xkdc Indirect communication " Indirect communication

More information

Consistency and Replication 1/65

Consistency and Replication 1/65 Consistency and Replication 1/65 Replicas and Consistency??? Tatiana Maslany in the show Orphan Black: The story of a group of clones that discover each other and the secret organization Dyad, which was

More information

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems Fall 2017 Exam 3 Review Paul Krzyzanowski Rutgers University Fall 2017 December 11, 2017 CS 417 2017 Paul Krzyzanowski 1 Question 1 The core task of the user s map function within a

More information

Basic vs. Reliable Multicast

Basic vs. Reliable Multicast Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?

More information