}w!"#$%&'()+,-./012345<ya

Size: px

Start display at page:

Download "}w!"#$%&'()+,-./012345<ya"

Candice Harrington
6 years ago
Views:

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345<ya Comparison of Java Frameworks for Distributed Application Development BACHELOR THESIS Šimon Hochla Brno, Spring 2015

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Šimon Hochla Advisor: Filip Nguyen, RNDr. ii

3 Acknowledgement I would like to express my graditude to my supervisor RNDr. Filip Nguyen for his support, suggestions and motivation for the elaboration of this thesis. iii

4 Abstract Aim of this work was to compare and assess some of the recent frameworks for development of distributed applications in Java. The chosen frameworks are Apache Zookeeper, JBoss Infinispan, Akka, Netty and JPaxos. The work compares the technologies from both the practical and theoretical aspect. It discusses the provided consistency, availability and resilience guarantees, programming interface, deployment process and the intended usage. In the practical part the individual technologies are demonstrated on implementing some of the agreement and coordination algorithms for a distributed environment. iv

5 Keywords Apache Zookeeper, Netty, JBoss Infinispan, JPaxos, Akka, Java, CAP theorem, distributed systems, two phase commit, three phase commit, leader election, distributed locking v

6 Contents 1 Introduction Frameworks overview Applicability, intended usage and software requirements Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Algorithms technologies are based on Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Resilience, consistency and availability CAP theorem Apache Zookeeper Netty JBoss Infinispan Akka JPaxos Coordination and agreement in distributed systems Two phase commit Three phase commit Leader election Distributed locking Implementation of the algorithms File locking implementation Two phase commit Apache Zookeeper JBoss Infinispan Akka Netty Three phase commit Leader election vi

7 5.4.1 Apache Zookeeper Netty JBoss Infinispan Akka Distributed locking LOC and cyclomatic complexity analysis JPaxos Deployment on a cluster and test results Deployment details Apache Zookeeper JBoss Infinispan Netty Akka Performance resuts and their brief reasoning Subjective comparison of the technologies Apache Zookeeper JBoss Infinispan Netty Akka Conclusion vii

8 1 Introduction The area of distributed systems is known for being associated with many pitfalls. Besides a deep theoretical knowledge required from the programmers, development of distributed applications is error prone and can become very time demanding. This situations was an incentive for creation of tools that would help the programmers to focus on the concrete problems being solved instead of dealing with common tasks related to the distributed computing. Such tools facilitate the development providing an infrastructure for the runtime of a distributed system and a programming interface that is often very specific for the given technology. To the recent technologies of this kind with bindings to Java belong JBoss Infinispan, Apache Zookeeper, JPaxos and Akka. An alternative with a broader scope of applicability is the Netty framework. It enables development of network protocols and represents the technologies without an embedded support for distributed processing. The aim of this work is to compare and assess the above mentioned frameworks from both the theoretical aspect and the practical applicability for implementing distributed algorithms. In the theoretical part it is needed to analyse them from the point of consistency, availability and partition tolerance. Next task is to discuss their intended usage and the algorithms the technologies are implemented on. In the practical part the specific approaches of the given technologies will be demonstrated by implementing the following distributed algorithms. Two phase commit protocol and Three phase commit protocol applied on a distributed agreement for writing to a file, Leader election and Distributed locking for providing a unique access to a file. The algorithms will be deployed and tested on a cluster consisting of several physical machines. The work will next compare the performance results measured in the cluster and analyse the implementations using the metric lines of code and cyclomatic complexity. The second chapter brings an overview of the technologies with descriptions of their usage, programming interface and algorithms they rely on. In the third chapter they are analysed from the point of provided resilience, consistency and availability guarantees. The 1

9 1. INTRODUCTION fourth and fifth chapter is devoted to the specification of the selected algorithms and their implementations using the particular technologies. The Sixth chapter contains the deployment details and results from the testing. The last chapter summarizes the subjective opinions on the technologies acquired during the implementation process. 2

10 2 Frameworks overview 2.1 Applicability, intended usage and software requirements Apache Zookeeper Apache Zookeeper is an open source project of the Apache Software Foundation. Its purpose is to facilitate the development of distributed algorithms. A simple node architecture resembling a file system enables solving tasks requiring coordination. [JR13, 3] The typical use cases of Zookeeper are naming services (converting a name into a physical address on a network, configuration management (joining servers bootstrapping configuration without any incentive from the side of the centralized source), synchronization (implementation of algorithms like lock, barriers, queues), leader election, message queuing (enables applications communicating over network to send messages to queues and read them respectively). and notification systems. Zookeeper ships with C and Java API, the core library is written in Java and target production platforms are GNU/Linux, Sun Solaris and FreeBSD. Win32/Win64 and MacOSX can be used only for development Netty Netty is a framework providing tools for quick evolving of protocols for servers and clients by using an asynchronous event driven architecture. The technology is originally developed by JBoss, but now supplied as an opensource product of Netty Project Community. Netty is applicable everywhere where it is needed to designing new protocols. General purpose protocols are often used on places where a more narrowly focused version of the protocol would be preferable. Another case is when a legacy protocol cannot be used in the new environment. Netty is widely adopted by many organisations including almost every of the world s biggest software service providing companies. 3

11 2. FRAMEWORKS OVERVIEW The framework provides bindings to Java and can be used on both Windows and Linux platform JBoss Infinispan Infinispan is a data grid platform exposing interface for storing data in the form of key-value pairs. It guarantees a high availability of the service and provides a big scalability of the structure. The technology is developed by Red Hat, written in Java and provides bindings to Java. The most common place JBoss Infinispan is applied in are distributed caches in front of NoSQL databases, but it can also be used as a NoSQL store itself or just for the clusterability purposes in any frameworks Akka Akka is a toolkit and runtime for creating distributed message driven applications running on the JVM. It is developed by Typesafe Inc. and distributed as an opensource project. Above other programming models it provides actor-based concurrency 1. Akka is a robust technology offering wide scale of services for specific scenarios. It is used in big systems utilizing transaction processing mostly in gaming, finance and trading industry, in parallelism requiring tasks and many other places. Akka is a cross-platform framework and provides language bindings for Java and Scala. The core is written in Scala. The first release with support for Java comes from JPaxos JPaxos is a project developed in an university environment providing facilities for efficient state machine replication 2. It enables development of highly crash tolerant applications that permit a message loss and communication delays method for implementing a fault-tolerant service by replicating the service over a set of machines called replicas 4

12 2. FRAMEWORKS OVERVIEW JPaxos can be used as an experimental platform for research purposes in the service replication or in the commercial products respecting the LGPL3.0 licence 3. The design is based on solid theoretical foundations and represents the outcome of the currently conducted communication research. It is bundled as a Java library and the latest release comes from january of Algorithms technologies are based on Apache Zookeeper There is a single communication way between the participating serv ers and it is through a shared hierarchical namespace similar to a file system. This hierarchy consists of nodes called Znodes with a parent node, which is common to all ZNodes. The nodes can be identified by a path constructed using their respective parent nodes separated by a slash (/) with the root node at the beginning. Each ZNode except of the root has exactly one parent node and optionally its own set of children. Furthermore there is a small data structure associated with every ZNode, which can be used for storing any data up to 1 MB, but preferably much smaller. There are two types of ZNodes: persistent and ephemeral. Ephemeral nodes cease to exist after the client session is terminated, persistent nodes will remain in the system until they are deleted. When a node is created it must have a unique path assigned. This can be achieved by the functionality called sequence nodes that adds a monotonically increasing number to the end of the path if several nodes with the same name under common parent are created. [Reeb] The framework provides an event system called Watches, which can be utilized together with the ephemeral and sequential nodes for informing about a new client connection or disconnection. A watch is assigned to a specific node and can be triggered by its creation, deletion, modification or when there is a change in its children. ZooKeeper is a service replicated over a set of servers running in a distributed environment. These servers keep an in-memory copy of the data tree along with transaction logs and snapshots stored in

13 2. FRAMEWORKS OVERVIEW a persistent memory. Clients always connect only to a single Zoo- Keeper server. The client establishes a TCP connection with the given server and sends heartbeats informing about its availability. When the server fails, the client reconnects to a different one. The Zookeeper service is available as long as a strict majority of the servers are running. There is one master node, which is dynamically elected and through which all write operations are performed. The master guarantees that after a successful write operation there will be a majority of nodes keeping the recent value. Write operations in Zookeeper are expensive compared to read operations, because they request all the servers to perform the update operation while when reading only a single server is needed. Zookeeper is meant to be used in an environment with a very big number of nodes with mostly read operation and not sending large quantities of data Netty The main idea behind Netty is its asynchronous architecture. It uses operations that are strictly non-blocking, but share the same thread at the same time. This means that every method returns immediately rather than waiting for the result. What is returned instead is a ChannelFuture which can be listened on to find out whether the method succeeded, failed, or has been canceled. The communication unit sent servers is ChannelBuffer, interface providing abstract view for one or more primitive byte arrays. Netty includes a standard conversion for many types including String, HTTP and other, but if the user wants to use his own, he must provide both an encoder and a decoder. As the abstract representation of a socket capable of operations such as read, write, connect, and bind the Channel component is used. Above all, it provides the information whether it is in the connected or disconnected state and keeps a ChannelPipeline holding the channel handlers assigned to the channel. The main business logic is usually implemented in the class extending ChannelHandler. It contains methods, which the user can override and implement by his own need. One of them is channel- Active(ChannelHandlerContext ctx) triggered when a new 6

14 2. FRAMEWORKS OVERVIEW connection with another channel is established. ChannelHandler- Context holds a reference to that channel. Next one is channel- Read(ChannelHandlerContext ctx, Foo msg), which is triggered when some data is send to the channel. The message can be answered by calling the write method on the channel it was received from. ChannelPipeline passes the raw data to the first handler and returns it in the final form from the last one. This is the place also the encoders and decoders are engaged in. A Bootstrap instance starts and stops Netty applications. ServerBootstrap uses the method bind(), while Bootstrap the method connect() JBoss Infinispan JBoss Infinispan is based on cache objects providing a map like interface extending the java.util.map. These caches can be used for storing any key-value pairs of Object types. The caches are able to be persisted across the cluster in four ways. Local cache is the mode when the cache is stored only in the given node locally. The next one is invalidation mode, where all entries are saved in a cache store like a database. When a node wants to read a value, it will load it from the cache store storing the invalidated values. The third one is a replicated mode, when the caches are uniformly replicated over all of the nodes in the cluster. Finally, there is a distributed mode, where only a subset of all nodes is storing the given cache. This can be useful when it is wanted to achieve a certain point of fault tolerance, but retain scalability as well. For getting the access to the caches an instance of CacheManager must be created and passed an XML file providing configuration of the given caches. On this object it is possible to call the get method with the name of the requested cache. Now the cache can be used as an ordinary map object or it can be set a listener on watching the state of the cache. This listener can have several methods triggered depending on the type of the change in the cache. Above all, it is creating a new entry, removing an entry or its modification. Next, the user has got an option to set the listener not on the cache, but on the CacheManager. Here it is possible to watch the changes of the nodes structure triggering the event when either a 7

15 2. FRAMEWORKS OVERVIEW new node in the system is created or some of the nodes disconnected. This attributes to the nodes located possibly on a different machine in the cluster using one of the caches contained in the configuration file Akka Akka architecture is based on the Actor model defined in 1973 by Carl Hewitt popularized later by the Erlang language. Actors can be perceived as a higher level abstraction of objects. They are similar to a person abstraction in the way that every actor has its own place in the hierarchy and is responsible for the actors that are situated below him. They encapsulate their state and behaviour and communicate with their environment only through a single request-response method. The actors are implemented by extending the UntypedActor class that defines one abstract method onreceive(object message). This method determines the behavior of the actor. The received message is of the Object type, which enables sending data of any kind. It is possible to infer the original type by calling the getinstance- Of() method on the received object. Next methods accessible from the context of the method onreceive() are getsender() for getting the reference to the sender of the type ActorRef and a similar one, getself() to access the reference to the given actor. The actor reference provides a method tell(object message, ActorRef sender) for sending a message to the actor it is called on with the reference to the sender in the parameter. Another important part of the Akka functionality is hidden in the method getcontext() accessible also only from the onreceive() context. It exposes information for creating new child actors, getting the system that the actor belongs to, parent supervisor and supervised children. The other methods that can be overriden on the UntypedActor are methods prestart(), prerestart(), poststart(), poststop() for managing the actor in the different states of the lifecycle. Lifecycle Monitoring is the way to watch a node for its termination. There is a method watch(actorreftargetactorref) accessible from the ActorContext referencing a watched actor. 8

16 2. FRAMEWORKS OVERVIEW The last two functionalities, which cannot be omitted, are the routing and the remoting. Messages can be sent to a router, which is an actor that directs the messages to the destination actors. There are many possibilities how to route the messages. Akka contains routers providing support for load balancing, round robin mechanism, scatter gather or a simple broadcast. Remoting is the way to either lookup or create actors in a different ActorSystem running possibly on a separate machine. This can be achieved by changing the LocalActorRefProvider to and assigning the specified paths to be created in a remote host JPaxos JPaxos implements the Paxos distributed algorithm. It adheres to the basic principles of the protocol and introduces several optimizations by utilizing techniques like batching for performance improvements required from the modern systems. Paxos algorithm is used for making agreement on a value among a group of nodes. It guarantees that after one node sees a value that a majority of the nodes sees as well, the majority will never decide on a different value. Any change proposed by a node must therefore be first agreed on by the majority of the nodes. It requires from the nodes a strict ordering of the messages. All changes are assured to be applied in a certain order on all the nodes in the system. JPaxos provides three types of instances the end user is working with. Service interface upon which the user provided functionality is defined. The interface specifies methods for the interaction with the JPaxos system. Next one is the Replica class, which when deployed is bound to the implementation of the Service. The last one is the Client class, which is used for sending client s requests to the service. When the Replica instance is created, it is passed a Service instance, configuration class and an unique id for the identification of the replica among other replicas. There are usually several replicas started in the system doing the same job just adding the robustness of the system. Akka provides several types of Service interfaces with a different degree of abstraction. The most common one is the SimplifiedService operating with byte arrays and SerializableService, which performs 9

17 2. FRAMEWORKS OVERVIEW serialization automatically. Above all, they define a method for the execution of the request-response communication that accepts the client data and returns a response back to the client. Next two abstract methods, the user needs to implement, are for creating a snapshot and updating to a certain version of a snapshot. A snapshot consists of the data the service persists between the individual executions. Client requires upon creation a configuration file with the specifications of the system containing the host addresses of the replicas. Client provides two methods, for connecting to the system and executing requests. It connects to the replicas dynamically, when the one it is connected to crashes, other one will be used instead. 10

18 3 Resilience, consistency and availability 3.1 CAP theorem CAP theorem states that it is impossible for a distributed computer system to provide all three of the following guarantees at the same time: Consistency, Availability and Partition tolerance.[sg] Consistency There must be a total order on all operations performed in the system. Any node can see exactly the same state of the requested resource. Availability For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.[lam86] It means that any finite algorithm on a distributed system must eventually terminate. There is no restriction on how long it will take, only that it will happen in a finite time. Partition tolerance System continues to work after losing arbitrary number of messages. Any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost. There has been a lot of debate on the topic of which combination of these three guarantees is the most preferable. The conclusion is that in the systems build upon real world networks it is almost impossible to find a case where no messages will ever be lost and no nodes will fail. Eventually, one ends up with decision between the two Availability and Consistency. Either way is a certain tradeoff with its consequences. In the case when the consistency is chosen over availability it is needed to cope with the issues about the perceived liveliness of the system from the point of the client. When there are any writes in the time of the system unavailability they must be buffered so that the operation could proceed after the temporary partition is resolved. This can also lead to an inconsistency when the buffer fails and the writes are lost. The other strategy is to inform client with an error message and request him to wait until the system is available 11

19 3. RESILIENCE, CONSISTENCY AND AVAILABILITY again. When it is decided for the availability over consistency there can possibly happen that multiple readers get different results of the same property. It is because every reader receives just the most recent value that the node it reads from can provide not the last one value written in the whole system. This requires the application programmers developing their services upon the system to cope with these inconsistencies and try to provide eventual consistency 1. It is up to programmers to update the value to the most current state after a divergence has been detected, which requires history tracking and update merging. 3.2 Apache Zookeeper Apache Zookeeper provides the following consistency guarantees: atomicity updates either succeed or fail sequential consistency updates are applied in order reliability updates persist once applied single system image a client sees the same view of the service regardless of the ZK server it connect to timeliness the client s view of the system is guaranteed to be up-todate within a certain time bound (eventual consistency) [Reea] Zookeeper tolerates partitions up to n failed server nodes from an ensemble consisting originally of at least 2n+1 servers. These n+1 servers are called quorum. There are two situations that can happen. When the partition occurs the current leader can fall either into the part holding the quorum and the system will continue normally or into the non quorum part and then a new leader must be elected. After the non quorum part rejoins back it updates the values created in the time of its disconnection. [fai] 1. Eventual consistency is a consistency model that guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value 12

20 3. RESILIENCE, CONSISTENCY AND AVAILABILITY 3.3 Netty Netty is a representation of a technology, which does not provide any support for resuming after a server or client disconnection and it is up to the developer to implement this behaviour. When a channel is closed, meaning that a client that was previously connected to the server has terminated, the method channelinactive() will be triggered. As Netty is purely asynchronous, all operations are performed without waiting for the return value. What can be utilized to get the information about the success or unsuccess is the object returned by every operation called ChannelFuture. It is a handle to the event related to the operation it is returned by. A listener can be set to the ChannelFuture for getting informed about the result of the operation. That the operation completed does not mean that it succeeded. In fact there are four possible results: Success, Failed, Timed out, Cancelled. 3.4 JBoss Infinispan Infinispan is described to be leaning towards providing Consistency and Availability, sacrificing Partition tolerance. However, Infinispan offers option to disable or enable partition handling. In case of the partition handling disabled the partitioned node is assumed to be crashed and to rejoin the cluster is must be restarted. It starts again with a different JGroups address and without holding any data. When the cluster is separated into two partitions where there are any unsynchronized updates, after joining together, the partition with the larger number of nodes will overwrite the data hold by the smaller partition. In the other case a node after disconnection merges back to the cluster without a restart. During a merge, there are five possible scenarios which can occur. There is a complex theory behind the process of merging in Infinispan. It is always trying to resolve the inconsistencies and make the best from the current state provided. The process also depends on the mode the caches are set to and only the eventual consistency can be assumed. What in Infinispan is always 13

21 3. RESILIENCE, CONSISTENCY AND AVAILABILITY at the first place is the availability of the system. 3.5 Akka The availability and partition tolerance guarantees are bundled together. Akka uses vector clocks for partial ordering of events. In each update to the cluster state the vector clock is updated to. For broadcasting the message with the current state of the system the Gossip protocol is used, which is sends the information in a randomized fashion with preference to the nodes that have not seen the last version. Gossip convergence is the process utilizing this protocol finishing when all the nodes see the same view of the system. Throughout the gossip convergence the system continues on and only the membership management is affected. Any new nodes wanting to join the cluster must wait until the gossip convergence is done. For the partition detection Akka uses The Phi Accrual Failure Detector. It determines whether the node is actually down or it is just a temporary disconnection from the history of the behaviour of the node. The threshold for deciding for either of the states is configurable. This can be used to provide the eventual consistency.[akk] 3.6 JPaxos JPaxos supports two modes of operation. Basic mode is able to tolerate n replica breakdowns when the system is deployed on at least n+1 machines. The extended mode uses a non-volatile memory for persisting snapshots so any number of the replicas can corrupt and the system will be able to recover. The time period in which the stored snapshot is refreshed is determined by the selected crash model. It can be persisted on the start of the replica, periodically or after every state change that is suitable only for highly crash prone systems because of the high demands on the performance. The Paxos algorithms guarantees strict ordering of the updates so the state of a replica is kept consistent to other replicas. 14

22 4 Coordination and agreement in distributed systems This chapter brings a theoretical background for the algorithms implemented in this work. The decision for the given algorithms is discussed in the beginning and then follow the individual definitions of the algorithms. First two of these algorithms are Two phase commit and Three phase commit protocols which are used for a distributed agreement about a globally applied change in a distributed system. Next algorithm is Leader election for deciding on a leader in a group of members. The last one is Distributed locking for synchronizing access to shared resources. These algorithms were chosen because of their simplicity and the fact that they have very a long tradition in the area of distributed systems. There are more sophisticated variants of them out there, but aim of this work is to demonstrate the pchde these functions. 4.1 Two phase commit Two phase commit (2PC) algorithm is a way to decide on approving or cancelling a new transaction in a group of nodes. There is one coordinator node, which manages the transaction, and several participants that can either agree or disagree with the proposed transaction. The algorithm consists of two phases, a voting phase and a commit phase. In the voting phase the coordinator sends a request to each participant. The participants decide to commit or abort the transaction, log their decision, lock the shared resources in case of commit, and send their votes back to the coordinator. In the commit phase the coordinator makes a final decision based on the collected votes and sends it to all of the participants. The decision is commit in case when all participants have commited the transaction and abort in any other case. Then a participants either proceeds with the transaction, in the case of commit, or rollbacks the transactions, when the decision is abort, and sends an acknowl- 15

23 4. COORDINATION AND AGREEMENT IN DISTRIBUTED SYSTEMS edgement back to the coordinator. When the coordinator collects acknowledgements from all of the participants the transaction is done. [PAB09] 4.2 Three phase commit Three phase commit (3PC) is a refinement of the 2PC algorithm that guarantees a strong termination when utilizing timeouts. The system cannot get to a blocking state, when all nodes are waiting for one crashed node to recover. It can happen in 2PC when both the coordinator and one of the participants fail at the same time. This is provided by adding a third phase between the voting and commit phase called precommit phase. It follows after all the participants have agreed with the transaction and before the coordinator sends the decision to commit. [IK] 4.3 Leader election This is the most platform dependent algorithm meaning it can be achieved in various way depending on the technology it is implemented on. There are three rules, which the implementation must obey. The algorithm must terminate in a finite time, only one node is selected as a leader and every other node is informed about the new leader after it has been elected. [IG00] 4.4 Distributed locking Distributed locking is needed anytime there is a shared resource replicated over a set of separate nodes which needs to be accessed only by one of the members at a time. There are basically two methods, which need to be implemented. The lock method, which waits until the resource is available and then provides a unique access and the unlock method for releasing the locked resources. 16

24 5 Implementation of the algorithms This chapter describes the implementation strategies specific for the chosen technologies when applied on the algorithms from the chapter four. There is a description of the file locking mechanism common to all implementations in the beginning followed by the independent implementations categorised by the selected algorithms. 5.1 File locking implementation For the demonstration of the possibility to achieve a synchronized access to shared resources by using the selected algorithms, the implementation provides a mechanism for a distributed file locking. There are these suppositions related to the locking mechanism. No other process should be able to delete or modify the file after assigning a lock in the given process and only this process should have the privileges to perform the write operations during that time. The lock will be released when the process either terminates or calls the unlock method. FileLock API from java.nio.channels provides just the abovementioned functionality and is platform independent so it is used in the file locking implementation located in the class LockFileDemo. There are three methods defined in LockFileDemo. LockFile(), releaselock() and writetofile(string data). The first one tries to assign an exclusive lock for the file specified as a global variable. If the file is already locked by a different JVM, the process exits with error, because it is an unexpected usage. Otherwise it assigns the newly acquired lock to a global variable holding the reference to the lock. WriteToFile() simply appends the given string to the instance of java.io.random AccessFile opened by the lockfile(). The lock can then be released by the method releaselock(), which informs the user when the lock has not been defined or is invalid. Furthermore, releaselock() closes the file reference created by lockfile(), so it is not possible to use the method writetofile() any more. 17

25 5.2 Two phase commit Apache Zookeeper 5. IMPLEMENTATION OF THE ALGORITHMS The coordinator using the Zookeeper node system creates a transaction node under which the participating sites create their respective child nodes with the flags EPHEMERAL and SEQUENTIAL when they join the system. Each then site sets a watch on the change of the coordinator s node data content. The coordinator waits until the expected number of sites joined and then asks them to give their votes in the way it writes the request to its node s data content. After a participant decides commit or abort it sends its vote to the coordinator by writing it to the its respective node.the coordinator watches for any changes of its children in a loop and waits until either one of the participants votes abort or all sites have voted commit. Then it writes the final result to the transaction node. The sites send their acknowledgements after receiving the result back to the coordinator in the same way as they provided their votes. After the coordinator collects all the acknowledgements, the transaction is done JBoss Infinispan There are two replicated Infinispan caches used. The coordinator cache for storing either the coordinator request or decision and its value and the sites cache where the participants addresses are mapped to string values. First the sites are started. After a site starts, it opens the coordinator and sites cache and sets a listener on the change of the coordinator cache. The coordinator is started no sooner than all sites, which want to participate, have opened their session. The coordinator collects the addresses of the remote cache managers using the sites cache and puts one by one into the sites cache mapped to an empty string. Then it requests the sites to give their votes by writing the transaction request to the coordinator cache under the key request and sets a listener on the sites cache. Now every site finds its respective address in the sites cache key set and writes its transaction decision as the value under its address. The coordinator listener 18

26 5. IMPLEMENTATION OF THE ALGORITHMS is triggered when each site gives its vote and waits until all sites have provided their votes. Then it decides for the final result and propagates it to sites by writing it to the coordinator cache under the key decision. After all the site have acknowledged receiving the result in the same way as they voted, the transaction is done Akka The implementation uses two functionalities provided by Akka, remoting and routing. First all the participants are started. They create their actor systems and then wait for the request from the coordinator. The coordinator after setting up his own actor system creates the participant nodes for all sites participating in the transaction. Here the remoting is used, all creations are redirected according to the configuration to the specific remote actor systems. Now the coordinator creates his own Actor and in the pre- Start() method it requests the participants for their votes. This is done by using the routing functionality. There is a router actor of the type BroadCastGroup created in the coordinator s actor system, which redirects all messages to other actors specified in a list assigned by his creation. This list contains the paths to the participant nodes. Now the participants and the coordinator communicate in the way of request-response using the method onreceive(object message) from which calling the getsender() is used for tracking back the sender address and sending the response. After the coordinator collects the messages with the decisions from the participating sites, it decides the results, sends it back to all sites and waits for their acknowledgements Netty When implementing a new protocol in Netty it is advised to start upon one of the examples from the Netty repository. In this implementation the Telnet protocol is used. There is one server representing the coordinator and several clients for the participants, which connect to the server on the start. When a connection to the server is initiated, the servers method channelactive(channelhandlercontext ctx) is triggered. 19

27 5. IMPLEMENTATION OF THE ALGORITHMS From the ChannelHandlerContext the channel of the client is acquired and put to a variable of type the ChannelGroup collecting all the connected clients. When the size of the group reaches the desired number of participants, the coordinator sends the request for the vote to each participant. The communication then continues in the form of request-response using the same principle as in Akka. 5.3 Three phase commit Three phase commit protocol works in the very same way as the two phase commit protocol with the only difference that it adds the second commiting phase. The transition between the implementations is therefore straightforward. Instead of the request commit sent by the coordinator, two types of requests are used: precommit and docommit. On the precommit the participant responds with ACK. After having collected acknowledgements from all the participants the coordinator sends the final result docommit. The participant then comes to the commited state and sends the second acknowledgement havecommited. The transaction is finished after the coordinator collects the acknowledgements from all participants. However, the power of the three phase commit does not reveal until having implemented the timeouts. Real world implementations are able to manage response expirations to handle crashes of either the coordinator or the participants. In this work just the basic functionality of the protocol is implemented and the timeouts management would make for another chapter. 5.4 Leader election The aim is to solve the problem of selecting one node from a group of nodes, inform the other nodes about which nodes has been chosen and repeat the process when the selected node fails. 20

28 5.4.1 Apache Zookeeper 5. IMPLEMENTATION OF THE ALGORITHMS Leader election in Apache Zookeeper can be implemented in a very simple manner. Suppose a ZNode called /election is created under which each election candidate appends a new node with EPHE- MERAL and SEQUENTIAL flags. The name of the node will be automatically appended with an index of a value greater than of the sibling node with the largest index before it was created. The process that created the node with the smallest index is the leader. At the time the leader process terminates, its respective node is deleted automatically because it was created as an EPHEMERAL node. This is the event other nodes must be watching for. To avoid the unnecessary herd effect only one node will be notified and it is the node with the next minimal index after the node that terminated. This process continues until there there are no more nodes under the /election Netty Here the same architecture is used as in the two phase commit with one server as the coordinator and several clients representing the election candidates. When each client joins the coordinator, its respective Channel is added to the pool collecting all connected clients. After the given number of candidates have connected to the server the method electleader() is triggered on the coordinator which determines the leader and then sends the election result to all the participants. The election method is based on choosing the client communicating on the channel with the minimal id. When a member gets the message that it has been selected as the leader it starts the leader procedure and after it has finished the member terminates. This event is watched by the listener implementing ChannelFutureListener located in the coordinator. The leader is removed from the pool of candidates and a new leader is elected. This repeats until the pool is empty. 21

29 5. IMPLEMENTATION OF THE ALGORITHMS JBoss Infinispan In this implementation two caches are used: electablememberscache and leadercache. ElectableMembersCache persists keys of the Long type designing the time when the given member acquired for the leadership mapped to the addresses of the nodes. The addresses are acquired in the same fashion as in the Two phase commit, they represent the address of the remote EmbeddedCacheManager of the respective node. There is one public method becomeelectable()s called on the start for joining to the electable members. First the pair of the current time and the address of the node entering the group of the electables is put to the electablememberscache. Followingly, the minimal value from the key set of electablememberscache is acquired for determining the node that joined before any other node. This minimal index is compared to the key under the current node has been saved and if they equal the node becomes the leader and performs the leader procedure. Otherwise it sets the listener on the change of the Embedded- CacheManager for watching the changes of the nodes topology. The event after triggering the listener holds two values: set of the old members before the listener has been notified and set of the new members after the topology change. For getting the members that disconnected in the current change it is needed to subtract the new members from the old ones. Now for every node that has disconnected its respective key value pair from the electablememberscache needs to be deleted. In the end of the listener the new leader election is performed in the way that the member with the currently minimal index becomes the leader. This repeats until the electablememberscache is empty Akka Leader election in Akka uses the module Akka Cluster that needs to be added as a project s dependency. The actor representing the electable member listens in the method onreceive(object message) for the cluster changes by waiting for three types of messages. CurrentClusterState sent to 22

30 5. IMPLEMENTATION OF THE ALGORITHMS the subscriber when it initiates the session, MemberUp when a new member joins the group of electables and MemberRemoved when a member disconnects. After receiving the CurrentClusterState the members from the current state are added to a sorted set using the age comparator. Every member of the type Member owns the built in function isolderthan(member m) that can be utilized in the comparator. The method currentmaster() then returns the first member from the set of members, which is the leader. This implementation has been adopted from the Akka samples because it perfectly demonstrates the built-in infrastructure for the membership handling. 5.5 Distributed locking The locking part of the algorithm can be achieved in the very same way as the process of the leader election. The differences are seen only in the behaviour after the member has been selected. The method for becoming electable simply translates to the method for acquiring the lock. Instead of performing the leader procedure all the members are informed by the selected member that it has acquired the lock. Now the member can perform a method requiring exclusive access to some resources. After it is done, it calls the unlock method, in which it informs the other members that it has released the resources and the node can continue to work without terminating. The unlock method can be implemented in various ways depending on the technology. In JBoss Infinispan a new type of listener is used. Instead of listening on the change of the node topology it is listened on the removal of the entry in the cache with the reference to the node holding the lock. In Zookeeper the members simply watch for the removal of the ZNode belonging to the process holding the lock. In Netty instead of closing the session after finishing the leader procedure a message is sent back to the coordinator, which consequently deletes the given member from the watched channels and performs a new election without considering it. 23

31 5.6 JPaxos 5. IMPLEMENTATION OF THE ALGORITHMS JPaxos framework was eventually excluded from the group of technologies used in the implementation part and that from the following reasons. State machine replication is a theory that is applicable in use cases when one server instance is wanted to be replicated over a set of machines. However, in case of these algorithms, the separate servers need to keep their own independent state for making individual decisions on the transaction result in the 2PC and 3PC protocols. This in JPaxos turns out to be impossible because the replica and service are bundled together and just than replicated. In the leader election a special flag for designating the leader is needed again to be applied only on one of the servers. Next reason is the fact that the technology is slowly becoming outdated since it has not been updated for almost two and half years. The documentation is currently largely unfinished and there are issues in the library still waiting to be resolved. 5.7 LOC and cyclomatic complexity analysis This part compares the implementations from the point of two metrics, lines of code and cyclomatic complexity. The results are organised in a table. 24

32 5. IMPLEMENTATION OF THE ALGORITHMS Figure 5.1: Two phase commmit Figure 5.2: Three phase commit 25

33 5. IMPLEMENTATION OF THE ALGORITHMS Figure 5.3: Leader election Figure 5.4: Distributed locking 26

34 6 Deployment on a cluster and test results This part is devoted to specifying the configurations needed to be made when deploying the given technologies on a cluster. The user experience from working with a certain technology for distributed processing is largely affected by the easiness of its deployment on the cluster. There are several ways of looking up the cluster members via the network. Every time it is a combination of a host in the form of an IP address and a port the node is listening on. But not all the technologies use the same concept. Some need to be configured only on the client side and then the server responds back by inferring the client address automatically. Other use several servers, which need to know of each of themselves and when the configuration changes, each server must be reconfigured manually. The last method is utilizing a multicast host which is a very flexible approach and allows to change the nodes structure anytime. 6.1 Deployment details Apache Zookeeper The cluster formed by remote servers in Zookeeper is called a Zookeeper ensemble. Each machine in the ensemble must be configured to know about every other machine. The configuration of a Zookeeper server consists of the following steps. First it is needed to install Java JDK and set the appropriate Java heap size so that the Zookeeper server was able to run without unnecessary swapping. Then follows the installation of the ZooKeeper Server Package by downloading and unpacking it to a specified directory. The last thing required is creating a configuration file located in the server directory and setting up at least these parameters. Path to the data directory for saving snapshots of the system and the file with the id of the server, client port for the clients that want to connect to the server and list of the servers the ensemble consists of in the form server.id=host:port:port. The id is a unique identifier of the server and there is a corresponding file in the data directory with the name of the id. Host is the address of the given 27

Applications of Paxos Algorithm

Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1