Towards a Practical Survivable Intrusion Tolerant Replication System

Size: px
Start display at page:

Download "Towards a Practical Survivable Intrusion Tolerant Replication System"

Transcription

1 Towards a Practical Survivable Intrusion Tolerant Replication System Marco Platania, Daniel Obenshain, Thomas Tantillo, Ricky Sharma, Yair Amir Department of Computer Science at Johns Hopkins University {platania, dano, tantillo, rsharm22, yairamir}@cs.jhu.edu Technical Report CNDS April Abstract The increasing number of cyber attacks against critical infrastructures, which typically require large state and long system lifetimes, necessitates the design of systems that are able to work correctly even if part of them is compromised. We present the first practical survivable intrusion tolerant replication system, which defends across space and time using compiler-based diversity and proactive recovery, respectively. Our system supports large-state applications, and utilizes the Prime BFT protocol (providing performance guarantees under attack) with a compiler-based diversification engine. We devise a novel theoretical model that computes how resilient the system is over its lifetime based on the rejuvenation rate and the number of replicas. This model shows that we can achieve a confidence in the system of 95% over 30 years even when we transfer a state of terabyte after each rejuvenation. I. INTRODUCTION Critical infrastructures such as financial, transport, or SCADA systems play an important role in everyday life. In this world, availability, reliability, and security are paramount. However, it is well known that exploits exist in software architectures and that attackers use these exploits to compromise systems. As a consequence, critical systems must be built to tolerate intrusions: they need to support consistent, large state across the infrastructure, employing algorithms that guarantee correct distributed operation over long system lifetimes even in the face of intrusions, where part of the infrastructure is controlled by the adversary. To this end, Byzantine Fault Tolerant (BFT) protocols (e.g. []) can be considered a building block for the design of intrusion tolerant systems, because they are able to work correctly even if f out of 3f + replicas behave in an arbitrary manner. However, BFT protocols alone do not provide a solid basis for the construction of intrusion tolerant replication systems with long lifetime (e.g. years): if a smart attacker is able to compromise f replicas, it is likely that with time he will be able to compromise the f + th replica, enabling the attacker to cause inconsistencies. This is particularly true if all the replicas are identical, as is common in real-world deployments, because an identical attack surface allows an attacker to successfully compromise all the replicas in the system with the same attack. As such, it is important to diversify replicas as much as possible [2] [4]. We refer to this approach as defending the system across space. However, defense across space only increases the attacker s workload linearly by requiring the attacker to develop f + distinct attacks. The attacker can still spend the time to develop these attacks, especially if the system is long-lived. Hence, periodic rejuvenation of replicas is necessary to clean potentially undetected intrusions. The rejuvenated replica should restart as non-compromised to ensure any intrusion is cleaned, and should be diverse from all currently and previously existing replicas to ensure that the attacker has no advantage of prior knowledge. This forces the attacker to act quickly, otherwise any progress he or she has made in compromising system components will be undone. We refer to this approach as defending the system across time. Some of the current approaches in the field of intrusion tolerant systems [], [5], [6] offer weak solutions to defend systems across space and time: they either do not diversify replicas after rejuvenation, or they use coarsegrained diversity (e.g. at the operating system level) where the strength of such a system is limited by the small number of different copies available. Practical deployments of proactive rejuvenation raise the issue of how often to refresh replicas in order to maintain system correctness over its lifetime. Rejuvenating too often can reduce system availability, limit the size of the replicated state, and incur unnecessary overhead. In contrast, rejuvenating too infrequently can increase the likelihood that the attacker will succeed. In this paper, we present the first practical survivable intrusion tolerant replication system that provides defense across space and time, a term first used in [7]. Our main contributions are: The first proactive recovery protocol that supports large state. We devise two novel state transfer strategies: one that prioritizes fast data retrieval and one that minimizes bandwidth usage, which can be used to restart a replica from a clean state; A theoretical model that computes the resiliency of the system over its lifetime (e.g. 30 years) based on the rejuvenation rate, the number of replicas, and the strength of a single replica; and The first integration of subsystems that support the assump-

2 tions of a practical survivable data replication system: the Prime BFT protocol [8], which ensures performance guarantees even while under attack, and that recently has been integrated into the Siemens corporation commercial SCADA product for the power grid [9]; and the MultiCompiler [0] that produces different versions of the system, such that no two versions present an identical attack surface. This does not mean that the exploits have vanished, but rather that the attacker must craft a new attack for each replica. A new version of a replica is produced after its rejuvenation. In this way, the system is protected unless the attacker is able to compromise more than f components in a limited amount of time (the time for the entire system to be rejuvenated). This time is chosen so that the probability the attacker will succeed is low. In a previous work [4], Schneider et al. presented a framework that periodically rejuvenates replicas, using built-in operating system tools to generate diverse copies. In our work, we generate diverse versions of Prime replicas by augmenting the built-in operating system tools with a compiler-based diversification engine. A theoretical model for proactive rejuvenations has also been presented in []. While this model applies to stateless systems, in our work we define the rejuvenation rate taking into account the size of the state that may need to be transferred and the time to transfer it. It is worthwhile to note that proactive recovery and diversity do not protect against protocol flaws. If the protocol has a flaw that makes it vulnerable to attacks (e.g. SQL injection), then proactive recovery and diversity have no effect. In addition, proactive recovery and diversity do not protect against DDoS attacks. Handling resource exhaustion attacks is an orthogonal and complementary problem to the one described in this paper. The remainder of the paper is organized as follows: Section II introduces Prime and software diversity. Section III presents the system model. Section IV describes the proactive recovery algorithm. Section V presents the experimental evaluation of the proposed approach. Section VI illustrates the theoretical rejuvenation model. Section VII discusses previous works, and Section VIII concludes the paper. II. BACKGROUND A. Prime, a BFT protocol with performance guarantees In this paper we focus on Prime as the baseline to build a long-lived intrusion tolerant system. Unlike previous BFT protocols, Prime bounds the amount of performance degradation that can be caused by a malicious leader. To do so, Prime extends the typical pre-prepare, prepare, and commit phases for global ordering with a pre-ordering phase, where a correct replica that receives a client operation broadcasts that operation in the system together with a locally generated sequence number (see Figure ). If the operations injected by the leader during the preprepare phase do not match those in the pre-ordering phase, the leader is suspected by correct replicas and replaced. In addition, Prime replicas run a background protocol to estimate No Attack Attack L O L O PO REQUEST PO REQUEST PO ACK PO ACK PO ARU PO ARU PRE PREPARE PROOF MATRIX PREPARE COMMIT PRE PREPARE L = Leader O = Originator = Aggregation Delay PREPARE COMMIT Fig.. The pre-ordering phase in Prime helps to detect a delay attack and replace a malicious leader. the round-trip time to each other in order to compute how fast the leader should inject client operations for global ordering. If the leader fails to respect timeliness constraints, it is suspected by correct replicas and replaced. With respect to other BFT protocols, the pre-ordering phase and the round-trip time estimation in Prime introduce a small performance hit during normal operations. However, these are necessary to monitor the behavior of the leader and replace it quickly to guarantee that even under attack the client operations introduced by correct replicas are executed within a bounded-delay that depends on the current network conditions (see Section III). This allows Prime to perform an order of magnitude better than previous BFT protocols in the presence of an attack. B. Software diversity Software diversity, such as N-version programming [2], [3], was originally introduced for software reliability. More recently, cheaper software diversity solutions [2], [3] (i.e. not involving humans in the loop) have been used to defend software systems from code reuse attacks, in which an attacker exploits knowledge of the code by using snippets or entire functions from the application itself to perform an attack. The goal of software diversity is to evolve programs into different but semantically identical versions, such that it is unlikely that the same attack will succeed on any two variants [4]. In this paper we use the compiler presented in [0] to diversify Prime replicas. This MultiCompiler is based on the LLVM 3. compiler and makes use of no-operation insertion, stack padding, shuffling the stack frames, substituting instructions for equivalent instructions, randomizing the register allocation, and randomizing instruction scheduling to obfuscate the code layout of an application. After each rejuvenation the Multi- Compiler takes the Prime bitcode, i.e. a compiler-generated intermediate representation of the source code, and a 64-bit random seed and generates a diverse copy of a Prime replica from a large entropy space. Hence, if an adversary attacks all replicas in parallel, the probability to defeat more than f replicas is low. Diversity obtainable with the MultiCompiler complements diversity obtainable at the operating system level, e.g. using different distributions or different versions of the same distribution. 2

3 III. SYSTEM MODEL AND PROPERTIES The system is composed of n replicas, which run the Prime BFT protocol and communicate by exchanging messages. Replicas may suffer from Byzantine or benign faults. We characterize replicas based on their behavior: Correct replica: a replica that follows the algorithm, is consistent, and is not being partitioned or rejuvenated. Malicious (or compromised) replica: a replica that exhibits arbitrary (i.e. Byzantine) behavior not according to the algorithm. Crashed replica: a replica that stops working. A crashed replica can restart the application from the state on the disk. Rejuvenating replica: a replica that is experiencing a benign fault. In particular, even if the replica was malicious, during rejuvenation it switches to a crashed replica. Partitioned replica: replica that cannot communicate with at least γ correct replicas (see Sections III-B and III-C). Replicas are periodically rejuvenated to clean any intrusions. A fundamental condition for success is that a correct replica completes recovery before the rejuvenation of the next replica. We define the time between two consecutive rejuvenations of any replicas as the inter-rejuvenation period. A rejuvenation cycle is n times the inter-rejuvenation period. In addition, we require that client operations, also referred to as updates, are not injected into the system faster than correct replicas can execute them. In this way, we ensure that a correct replica that rejuvenates eventually catches up. All messages sent among replicas are digitally signed. We assume that digital signatures are unforgeable without knowing a replica s private key. We also make use of a collisionresistant hash function for computing message digests. In addition, we require that each replica is equipped with tamperproof cryptographic material to generate and store the replica s private key and sign messages without revealing that key. To this end, we use the Trusted Platform Module (TPM). The TPM also comes with a random number generator and a monotonically increasing counter. Many modern computers are sold with a TPM module built-in. In the following, we first specify the attack model, and then we describe the system model in the presence of 3f + replicas, as in typical BFT protocols, and in the presence of 3f + 2k + replicas [5], where k is the maximum number of crashed, rejuvenating, and partitioned replicas tolerated in the presence of f malicious replicas. In the presence of 3f + 2k + replicas the algorithm is guaranteed to make progress despite malicious/benign failures and the rejuvenation of correct replicas. A. Attack model We allow for a very powerful adversary that can exploit vulnerabilities of the operating system and the application to compromise and control a replica. The adversary can delay the sending and receipt of messages of malicious replicas, but it cannot affect communication among correct replicas. In addition, the adversary can leak the private key of a malicious replica and disseminate that key to other malicious replicas in order to send forged messages. The adversary can also compromise the state of a malicious replica. However, we assume that the adversary has no physical access to the system and is computationally bounded, such that it cannot subvert the cryptographic mechanisms described above. B. n = 3f + replicas We consider f to be the maximum number of replicas that can concurrently experience malicious and/or benign faults (including recovering, crashed, and partitioned replicas) within a vulnerability window, i.e. the maximum time T between when a replica fails and when it recovers from that fault []. We also assume that the network may experience temporary partitions. If the number of faulty plus partitioned replicas exceeds f, then the system halts until the partitioned replicas reconnect and catch up. In the presence of 3f + replicas we define a partitioned replica as a replica that cannot communicate with another γ = 2f correct replicas. In the following, we discuss how Prime properties presented in [8] change with proactive recovery. Property : SAFETY: If two correct replicas execute the i th update, then these updates are identical. To guarantee SAFETY in presence of proactive recovery, a correct replica has to implement a persistent memory that survives across rejuvenations, whose content has to be validated after each rejuvenation. We require that each correct Prime replica stores on the disk all messages that it sends to or accepts from other replicas. This prevents a correct replica from sending the same message twice or accepting two different messages with the same sequence number. After rejuvenation, the replica can reload the messages into main memory. Property 2: NETWORK-STABILITY: There is a time after which the following condition holds for a set S of at least 2f + correct replicas (stable replicas): for each pair of replicas r and s, there exists a value Min Lat(r, s), unknown to the replicas, such that if r sends a message to s, it will arrive with a delay r,s, where Min Lat(r, s) r,s Min Lat(r, s) K Lat, with K Lat a known network-specific constant accounting for latency variability. In those executions in which NETWORK-STABILITY is met, Prime guarantees the following LIVENESS property. Property 3: LIVENESS: If a stable replica initiates an update, all stable replicas eventually execute the update. This property guarantees that if the network is sufficiently stable each update is eventually executed by all correct replicas. To meet LIVENESS we must guarantee that recovery eventually completes and partitions eventually heal. Because we require that updates are not injected into the system faster than correct replicas can execute them, we guarantee that eventually a recovering replica will catch up. The LIVENESS 3

4 property does not specify how fast the updates need to be executed. When NETWORK-STABILITY is met, Prime provides a stronger performance guarantee: Property 4: BOUNDED-DELAY: There exists a time after which the latency for any update initiated by a stable replica is upper bounded. In the presence of proactive recovery even a correct and fast leader will be eventually rejuvenated. In this case we can still guarantee BOUNDED-DELAY, but for at most 3f occurrences the worst-case bound on BOUNDED-DELAY becomes t = 2f α + β, with α the time to detect and replace a malicious or slow leader, and β the time for a correct replica to complete recovery. We calculate t will be approximately hours for real systems with large state, with t dominated by β. Under normal operations (i.e. no recovery in progress), if in the presence of f failures an additional replica partitions away for a while and then rejoins, we are not able to guarantee BOUNDED-DELAY until the partitioned replica catches up. In this case we are only able to guarantee LIVENESS. C. n = 3f + 2k + replicas In this section we investigate how to reduce t, the worstcase bound on BOUNDED-DELAY after the rejuvenation of the leader. We avoid waiting for the complete recovery after the rejuvenation of a correct replica by adding other replicas in the system. Specifically, we need 3f + 2k + replicas, as previously described in [5]. k is the maximum number of crashed, recovering, and partitioned replicas tolerated in the presence of f malicious replicas during a vulnerability window T. In the presence of 3f +2k + replicas we define a partitioned replica as a replica that cannot communicate with another γ = 2f + k correct replicas. Augmenting the number of replicas requires at least 2f +k+ replicas to order updates. Hence, certificates collected during pre-ordering and ordering phases are composed of 2f + k + messages. The definitions of SAFETY and LIVENESS do not change in the presence of 3f + 2k + replicas. The only difference in NETWORK-STABILITY is that at any time the set S of stable replicas can be populated by any 2f + k + correct replicas, i.e. the minimum number required to order updates and elect a new leader. Hence, in a system with 3f + 2k + replicas, if NETWORK-STABILITY holds, we can still guarantee BOUNDED-DELAY over the rejuvenation cycle, but for at most 3f +2k occurrences the worst-case bound on BOUNDED- DELAY becomes t = (2f + k) α. We calculate t will be subsecond for real systems with large state. Compared to the case of 3f+, in the presence of 3f+2k+ replicas we have a higher number of occurrences in which the worst-case bound on BOUNDED-DELAY is t. However, the rejuvenation cycle is longer, and we also expect that the time (2f +k) α to settle on a correct and fast leader is much smaller than the time β to complete recovery after the rejuvenation of a correct replica. In addition, under normal operations (i.e. no recovery in progress), in the presence of f faults we can tolerate the temporary partition of at most k replicas, while still guaranteeing BOUNDED-DELAY. Unlike the 3f + replicas case, where the protocol can ensure only LIVENESS, the system does not need to wait for the partitioned replicas to catch up to also guarantee BOUNDED-DELAY. IV. PROACTIVE RECOVERY ALGORITHM The proactive recovery algorithm depends on a component trusted to periodically initiate proactive recovery in a round robin manner. We describe in Section V how this component can be implemented. After each rejuvenation we diversify replicas as described before. The main operations the protocol executes are: ) Replica rejuvenation: the server that hosts a replica is periodically rebooted; 2) Key replacement: the private/public keys of the rejuvenated replica are refreshed with the help of the TPM, invalidating all previous keys that could be compromised; 3) State validation: we assume the presence of a database to maintain replicated state. It needs to be validated before applying new updates; 4) State transfer: if the state is compromised, a clean copy of the state must be transferred from the other correct replicas; 5) Prime certificate validation: Prime messages are persistently stored on the disk to ensure SAFETY across rejuvenations. They are reloaded into main memory and validated before accepting or sending new Prime messages; and 6) Prime certificate transfer: if some certificate is compromised or missing, it must be retrieved from the other correct replicas. Next, we describe each operation of the proactive recovery protocol for a system with 3f + replicas. The modifications for a system with 3f +2k+ replicas are described in Section IV-F. A. Replica rejuvenation Periodically, the trusted component that runs a proactive recovery scheduler selects one replica to rejuvenate. Each replica is equipped with a physical read-only medium (e.g. CD-ROM) that stores a clean copy of the operating system, the Prime bitcode, the MultiCompiler, and the public keys of the TPMs of the other servers. Note that periodically a system administrator can replace the operating system with the latest version, including security patches. Hence, the rejuvenated replica restarts the application from a correct version of the operating system and recompiles the Prime bitcode with the MultiCompiler, which takes as input a random seed generated by the TPM. The replica is then ready to restart Prime and execute the next steps of the proactive recovery protocol. During recovery the replica does not execute pre-ordering and ordering operations. B. Key replacement Each Prime replica has two private/public key pairs: one pair is generated by the TPM at deployment time, the other one is generated after each rejuvenation and used to sign/verify Prime messages. The private key generated by the TPM is 4

5 stored in the TPM itself and cannot be leaked or deleted unless the adversary has physical access to the system and clears the TPM internal registers (we assume the adversary has no physical access to the system). The public key is disseminated to other replicas, so they can decrypt what this TPM signs. Private/public keys to sign/verify Prime messages, also referred to as session keys, are refreshed after rejuvenation. Indeed, a replica can be impersonated if the attacker leaks the private session key and sends it to other malicious replicas. Session key replacement is the first operation to execute. If the replica was malicious before rejuvenation, after invalidating old keys we consider that replica to be experiencing a benign fault. Specifically, the rejuvenated replica generates a new private/public session key pair; then it uses the TPM to sign a message containing the new public key. The TPM also generates and attaches a monotonically increasing sequence number to that message before signing it to avoid replay attacks. The message is then sent to all other replicas in the system. A correct replica accepts (and appends on a file) a new key if and only if the sequence number generated by the sending TPM is higher than the sequence numbers associated with the old public session keys of the same replica. From this moment until the next rejuvenation, the recovering replica will sign all messages with the new private session key. When at least f + correct replicas (at least 2f + replicas in total) accept the new key, all the previous keys of the rejuvenated replica are invalid. At this point, updates injected by an impersonator and signed with the old private key cannot be ordered by Prime. To ensure that a new key reaches all correct replicas, we use a forwarding mechanism: the first time a correct replica receives a new key it propagates the message to all other replicas. Note that a correct replica accepts a message signed with the TPM if and only if it contains a new public session key. Other messages signed with the TPM that could be injected by malicious replicas are dropped. Since old public keys may be required to decrypt older preordering and ordering messages during certificates validation (see Section IV-E), the file with session keys must be validated after rejuvenation. Because we expect this file to be small in size (e.g. 70 MB in a system with 0 replicas, a rejuvenation rate of one replica per day, and a system lifetime of 30 years), this operation is straightforward: the rejuvenated replica computes a digest of the file, and broadcasts a message to request the digest of the same file stored by other replicas. The rejuvenated replica waits for f + replies with the same digest: if this value matches the one computed locally, then the file is correct. Otherwise, the replica requests the file from those f + replicas one by one until it receives a file that matches the digest. C. State validation We assume that each Prime replica applies updates to a database. The content of the database represents the replicated state, which must be validated after rejuvenation. Correct replicas take a checkpoint of the state every x updates. This operation can be executed in several ways: modern databases allow users to take periodical snapshots using copy-on-write techniques, or dumping the database content to a text file. In addition, third party tools also allow users to execute checkpointing operation efficiently. In this work we do not specify the technique used to take a database snapshot, we only consider the output of a checkpointing operation, i.e. a blob of data. Replicas maintain ck checkpoints on the disk, with ck a parameter of the algorithm. Each checkpoint is logged in a file, which contains, for each checkpoint, the sequence number of the last executed operation and a digest of the state. After rejuvenation and key replacement, the fresh replica reads the state at the most recent checkpoint, e.g. ck i, and computes the digest. Then, a request for the digest of ck i, together with the sequence number of the last operation executed before ck i, is sent to all other replicas. A correct replica that receives such a request replies back with the digest of that checkpoint if it has that digest, otherwise it replies with a null value (the requested checkpoint may be too old or may refer to an invalid or non existing checkpoint if the rejuvenated replica was compromised). The rejuvenated replica waits for f + replies with the same sequence number and digests, or f+ null replies. In the former case, the replica compares the received digests with the one computed locally: if they match, the state at checkpoint ck i is correct, otherwise state transfer is necessary. In the case of f + null replies, instead, the rejuvenated replica selects from the log file an older checkpoint, e.g. ck i, and repeats the state validation process. If the rejuvenated replica has no valid checkpoints, it broadcasts a request for the most recent checkpoint taken by other replicas. The replica waits for 2f + replies and selects the checkpoint ck with the f + th highest sequence number. D. State transfer In a system with a potentially large state, the state transfer mechanism must be efficient to guarantee that the recovery of a compromised replica completes as quickly as possible to allow the system to rejuvenate more often, to support the assumption that the adversary does not have enough time to compromise more than one third of the entire system. In the following, we propose two different strategies, one that prioritizes fast recovery at the cost of bandwidth overhead, and one that minimizes bandwidth usage. A system administrator can choose the strategy that best satisfies the application requirements. We logically partition the state into several data blocks of fixed size. The rejuvenated replica recovers correct state by transferring these data blocks. ) State transfer reducing latency: The recovering replica requests a data block b i of f + replicas, while another f replicas are selected to send just a digest of that block. The recovering replica then computes a digest for each received copy of the data block until a correct copy is found, that is the one whose digest matches at least another f digests. After that, the recovering replica moves on to recover data block b i+. This approach reduces the time to complete state 5

6 transfer, because each block is collected in a single round (we are guaranteed that at least one copy out of f + is correct), at the cost of bandwidth overhead because each block is sent f + times. 2) State transfer reducing bandwidth usage: This strategy reduces the impact of malicious replicas that try to hamper the state transfer process by using a blacklisting mechanism: when a replica is blacklisted it is not contacted anymore during state transfer. The recovering replica requests a data block b i from one replica, while another f replicas are selected to send just a digest of that block. The recovering replica then computes the digest of the received block and compares this value with the f received digests. If they match, the data block is correct. This represents the best case: the data block is retrieved in a single round, with no bandwidth overhead (i.e. just a single copy of that block is sent). If digests do not match at least one more round is required. We propose two different variations. a) Variant : An additional replica, different from the previous f +, is selected to send another copy of the data block. This process repeats until the recovering replica finds a correct copy of the data block (at most f times). This minimizes the bandwidth usage at the cost of additional delay. The received responses (copies of the block and digests) are also used to identify and blacklist malicious replicas that sent invalid information. b) Variant 2: f additional replicas, different from the previous f +, are selected to send a copy of the data block. Then, the recovering replica has 2f + responses (i.e. f + blocks and f digests) and uses the same approach described in Section IV-D to find a valid block. In addition, these responses are also used to identify and blacklist malicious replicas that sent invalid information. This approach ensures that in at most two rounds the recovering replica finds a correct data block, optimizing latency compared with the previous variant, at the cost of a higher bandwidth overhead. When a valid copy of the data block is found, the recovering replica moves on to recover block b i+. It is worthwhile to note that the blacklisting mechanism ensures that variants or 2 are executed no more that f times during a state transfer instance. Because we expect f to be much smaller than the number of data blocks, the impact of malicious replicas on state transfer is negligible. The strategies we propose aim to efficiently retrieve a large state in the expected scenario: when a replica under the control of an adversary has state that is entirely compromised. However, one can implement more conservative strategies, in which a data block is retrieved only if it is compromised. In this case, the recovering replica will collect the digest from f other replicas first, compare this value to the digest of the same block stored locally, and then transfer the block if necessary. We extend these techniques to retrieve multiple blocks in parallel and balance the load across the other correct replicas. Finally, note that modern database systems implement state transfer strategies (typically in a client-server or publishsubscribe manner) to maintain up-to-date consistent copies of the same database. However, these solutions are not intrusion tolerant and hence not suitable for our system. E. Prime certificates validation Pre-order and order certificates, combined with the state, represent the memory of the system. A pre-order certificate is composed of 2f + messages: a client operation o and 2f acknowledgments with a digest of that operation. Each operation is uniquely identified by the pair (id, seq num), where id is the identifier of the replica that injected the update in the system, and seq num is a local sequence number generated at that replica. As in other BFT protocols, an order certificate in Prime is composed of: (i) a prepare certificate, with a preprepare message sent by the leader and 2f prepare messages; (ii) a commit certificate, with 2f + commit messages. A pre-prepare message in Prime contains a summary matrix that assigns an order to client operations, while prepare and commit messages contain a digest of that matrix. In addition, a preprepare message is uniquely identified by the pair (view, i), where view is the current view number and i is the i th preprepare during that view. During Prime operations, when a certificate is ready, a correct replica saves that certificate on the disk. These messages are reloaded into main memory after rejuvenation. In particular, pre-order certificates contain the updates that a rejuvenated replica has to apply to a correct copy of the state at some checkpoint in order to catch up the same execution point as before rejuvenation, if that replica was correct. Order certificates, instead, specify in which order those updates must be applied. Because the adversary can compromise pre-order and order certificates stored on the disk, they must be validated. ) Validation of order certificates: Order certificates are validated first, because they determine the set of updates that the rejuvenated replica expects to find during the pre-order certificate validation, and the order in which correct updates must be applied to the state. Because the recovering replica may be missing some messages when it is rejuvenated, it is necessary to estimate the current execution point to catch up. We use the same approach as in []: the rejuvenated replica broadcasts a request for the sequence number of the last pre-prepare message for which each replica obtained a prepare certificate. The replica collects sequence numbers, including its own number, and selects the value (v, p) such that at least 2f + different replicas reported a value greater than or equal to p for view v. This is necessary to ensure SAFETY if the rejuvenated replica was correct: p is guaranteed to be no smaller than the value p proposed by the rejuvenated replica. Verifying whether a prepare or commit certificate is valid requires the replica to decrypt the pre-prepare message, compute the digest of the summary matrix, and compare this value to the digests contained in prepare or commit messages. If the validation fails for some message, or there is a gap in the preprepare sequence, the rejuvenated replica sends a request to another f + replicas with the sequence number and view of the compromised/missing pre-prepare. Replies include the 6

7 pre-prepare message, prepare and commit certificates. In this way we are guaranteed that at least one reply contains valid certificates. The verification process is executed for all preprepare messages with sequence number up to (v, p). Note that the validation of an order certificate may involve just the pre-prepare and the 2f + commit messages, because the collection of a correct commit certificate implies that the ordering proposed by the leader has been previously accepted. However, the validation of prepare certificates is necessary to ensure that all Prime messages saved in the persistent storage are correct after rejuvenation. 2) Validation of pre-order certificates: The procedure to validate pre-order certificates is similar to the one described in the previous point. Verifying whether a pre-order certificate is valid requires the replica to compute the digest of the client operation and compare this value with digests in ack messages. Note that in Prime the complete set of update identifiers is obtained from pre-prepare messages. If the validation fails for some message, or there is a missing update, the rejuvenated replica sends a request to another f + replicas with the sequence number of the compromised/missing update. Replies include the client operation and 2f ack messages. In this way we are guaranteed that at least one reply contains a valid certificate. When pre-order validation completes, the correct set of updates is applied to the state. After that, the Prime replica is correct and is ready to resume Prime operations. F. Extension to a system with 3f + 2k + replicas Augmenting the number of system replicas requires some minor changes to the Prime protocol. As introduced in Section III-C, the number of messages to order updates and, consequently, the certificates size change from 2f + to 2f +k +. This is necessary because, otherwise, f malicious replicas can send different messages to two distinct sets of f + replicas, violating SAFETY. The proactive recovery protocol requires some minor changes. Key replacement: A new session key invalidates all previous keys of the same replica when at least another 2f +k replicas receive it. In this way we guarantee that Prime does not order updates signed with old keys. State validation: During the state validation the recovering replica still requires f + replies with the same digest. Even in the case when the recovering replica has no valid checkpoint, it waits for 2f + replies and selects the checkpoint ck with the f + th highest sequence number. State transfer: The strategy described in Section IV-D still needs f + copies of a data block and f digests. At most f replies, in fact, may contain invalid information, and the remaining f + replies are enough to find a correct copy of the block. The strategy described in Section IV-D2, including the two variants, requires no additional change. Certificate validation: The size of Prime certificates increases from 2f + to 2f + k +, while the transfer of missing/compromised certificates still requires f + replies. A. System deployment V. EVALUATION In this Section we describe the deployment of the survivable Prime system in a physical and virtualized environment. ) Deployment in a physical environment: The deployment in a physical environment is depicted in Figure 2. The proactive recovery logic runs in a separate computer and includes the scheduler that periodically rejuvenates one replica at a time, in a round robin fashion. The computer is connected to a network switch that, in turn, connects to netbooters (i.e. remotely activated power switches), one for each physical server that hosts a Prime replica. The network that connects the computer with the proactive recovery logic, switch and netbooters is completely isolated from outside communication, and thus cannot be accessed by the attacker. Physical servers are connected to netbooters through power wires. Periodically, the proactive recovery scheduler activates one of the netbooters to cycle the power and reboot the corresponding server. After rebooting, the replica executes the proactive recovery protocol as described in the previous section. Network connection Power connection Fig. 2. Deployment in a physical system, with the proactive recovery logic isolated from the external network. 2) Deployment in a virtualized environment: The deployment in a virtualized environment is depicted in Figure 3, and is based on Xen Cloud Platform (XCP). The proactive recovery logic runs in the privileged Dom0 domain on a separate physical server, while Prime replicas run in Xen guest domains. XCP allows creating a resource pool: a virtualized system that can be managed from a single hypervisor, i.e. the master, which can instruct other hypervisors to execute specific operations (e.g. creating/destroying virtual machines). The proactive recovery logic runs in the master hypervisor, which communicates with other hypervisors through a private network. The advantage of a deployment in a virtualized environment is that rejuvenation is much shorter than in a physical environment: a shadow virtual machine, in fact, can be instantiated in advance, with a clean copy of the operating system and a new version of Prime. When the shadow virtual machine is ready to start, the old one can be destroyed. We built a prototype that, as soon as the new virtual machine is ready to take over, shuts down the old virtual machine, detaches the virtual disk with the state and the virtual network interface from that machine, and attaches them to the new one. We measured the time taken to complete these operations to be approximately 8 seconds. 7

8 Proactive recovery Dom 0 Hypervisor Hardware Network connection Dom0 Replica Hypervisor Hardware Dom0 Replica Hypervisor Hardware Dom0 Replica Hypervisor Hardware Dom0 Replica TABLE I STATE VALIDATION AND TRANSFER MEASUREMENTS FOR DIFFERENT STATE SIZES AND NUMBER OF SYSTEM REPLICAS. state size state reading state transfer 4 replicas 7 replicas 0 replicas Gb 9 sec 36 sec 25 sec 23 sec 0 Gb m, 27 sec 6 m 4 m 4 m 40 Gb 5 m, 47 sec 24 m 5 m 5 m 80 Gb m, 30 sec 48 m 3 m 3 m 20 Gb 7 m, 5 sec h, 2 m 48 m 48 m 240 Gb 34 m, 30 sec 2 h, 24 m h, 38 m h, 36 m 520 Gb h, 4 m 5 h, 9 m 3 h, 28 m 3 h, 24 m Tb 2 h, 24 m 9 h, 50 m 6 h, 7 m 6 h, 7 m Fig. 3. Hypervisor Hardware Deployment in a virtualized system based on XCP. Note that XCP does not allow destroying a virtual machine before shutting it down; the shut down operation is what most impacts the time to transition. Finally, the old virtual machine is destroyed, while the second one can initiate the proactive recovery protocol. The virtualized environment poses two drawbacks: (i) the server that runs proactive recovery logic can be compromised if the attacker compromises the hypervisor; (ii) a virtualized environment offers poor or no support for TPM (i.e. solutions present in literature are either not implemented or only partially implemented), which is a fundamental component in our survivable system. Regarding point (i), a deployment in a virtualized environment is suggested if one trusts the hypervisor. This can be substantiated, as an example, by rebooting physical servers from time to time and verifying the status of the servers through a chain of trust (i.e. digest of BIOS, operating system kernel,...) computed by the TPM. Regarding point (ii), instead, we exploit the fact that Dom0 has full access to the underlying hardware to implement a TPM manager that mediates the interaction between the Prime replica in a virtual machine and the physical TPM. Each time the Prime replica has to access the TPM, it contacts the TPM manager in Dom0, which, in turn, accesses the TPM and replies back to the Prime replica with the outcome of the specific operation. B. Experimental results We deploy the system in a physical cluster with 3f + servers. Each server is a Dell PowerEdge R20 II, with an Intel Xeon E3 270v GHz processor and 6GB of memory. All servers are connected by Solarflare 56T 0GbE cards and an Arista 720T-4S switch, providing 0 Gigabit Ethernet. All servers run CentOS 6.2 as the operating system. In the following we measure the time to complete state validation and transfer in systems with 4, 7, and 0 replicas. The proactive recovery protocol has been implemented using Prime. [5], which uses an intrusion tolerant communication substrate built on top of Spines 4.0 [6]. Table I shows the measurements for state reading and transfer for different state sizes, from gigabyte to terabyte. These state sizes are quite common in SCADA systems, military command and control, banking systems, communication systems, to name a few. We evaluate the state transfer strategy described in Section IV-D2 that minimizes the bandwidth usage. The whole state is fragmented into data blocks of megabyte. We transfer these blocks in parallel, 5 at a time. During the experiment, we noticed that the time to transfer the state grows linearly with the number of data blocks. The performance hit obtained in a system with 4 replicas is due to the fact that some replicas are sending more data blocks at the same time, which produces a higher CPU consumption that limits the processing speed of the physical servers that host those replicas. Based on the obtained results, in Section VI we will show how to set the rejuvenation rate in our theoretical model. The model will output the minimum security requirement that a replica has to guarantee in order to achieve a desired confidence in the system over its lifetime. VI. THEORETICAL REJUVENATION MODEL In this section we present a theoretical model that allows a system administrator to determine the deployment parameters needed to reach the desired confidence in the system. As a building block, we present an equation that computes the resiliency of the system based on the rejuvenation rate and the number of replicas. The equation takes as input: (i) the probability c that a replica is correct over a year; (ii) the rejuvenation rate r, i.e. the number of rejuvenations per day across the whole system, which is limited by the system s state size; (iii) the number of replicas n; and (iv) the system lifetime y. A system administrator has control of r and n, but he does not have control of c, which represents the strength of a replica. The strength of a replica can be estimated with CERT alerts, bug reports, or other historical information, as in [7]. The output of the computation is the probability the system will remain correct over its lifetime. By iterating this computation many times with different possible values of r and n, the system administrator can determine the replica strength required to reach the desired confidence in the system. Alternatively, if the replica strength is fixed, the same computation can be iterated to find the required values of r and n to reach the desired confidence. A. Model description In our model we consider only failures related to software aspects, not hardware failures. We assume that every replica has the same probability of being compromised over a year. In 8

9 Required strength of a replica Number of replica rejuvena>ons per day 48 Not possible State 0 size (Gb) Required strength of a replica Number of replica rejuvena9ons per day f=0 f= f=2 f=3 f=4 (a) (b) Fig. 4. Required strength of a replica to achieve a confidence in the system of 95% over 30 years varying: (a) the rejuvenation rate and state size, while the number of replicas in the system is 7; (b) the rejuvenation rate, for different number of replicas in the system. addition, we assume that each replica fails independently. We support this assumption by making extensive use of diversity as explained before. Replicas are periodically rejuvenated one at a time in a round robin fashion. For simplicity s sake we optimistically assume that recovery is instantaneous. However, in Section VI-C we extend this model to relax this assumption at the cost of an additional 2k replicas in the computation and pessimistically assume that a recovery operation spans over an entire rejuvenation cycle. In order to compute the probability the system is correct over its lifetime, we first define p as the probability that a replica is correct at the end of an inter-rejuvenation period by simply scaling the probability c to the appropriate time interval: p = 365 r c. The following equation then gives the probability the system will survive over a lifetime of y years. n+ i=n+ f n (( coeff p j ) + p j x ) [i] j= y 365 r The innermost part of the equation is a product of the form n (( ) j= p j + p j x ). Each term of this product corresponds to one replica, where p j is the probability that the replica will remain correct to the end of the current rejuvenation round and p j is the probability that the replica will be compromised at the end of current rejuvenation round. This is because that replica has not been rejuvenated for the last j rounds and must have survived all of them in addition to the current round in order to remain correct at the end of the current round. By taking the product of these terms, the i th coefficient in the resulting polynomial is formed by the sum of all possible combinations of the p j x components of i of the terms with the ( p j ) components of the remaining n i terms. Thus, the i th coefficient represents the probability that exactly i replicas will remain correct at the end of the current rejuvenation round. We sum these probabilities for all the cases when the system would survive the round, i.e. the coefficients of the polynomial from the (n + f) th coefficient to the (n + ) th coefficient. This excludes any cases when more than f replicas have failed. This sum then gives the probability that the system will survive one rejuvenation round. In fact, there are many rejuvenation rounds in the system s lifetime, which can be treated as a repeated experiment. In order for the system to succeed for its entire lifetime, it must succeed for every rejuvenation round in its lifetime. The probability of success, then, is the product of the probabilities of success for all the rejuvenations rounds, here represented as the exponent r, to capture the y year lifetime, 365 days per year, and r rejuvenations per day. B. Application to realistic scenarios Let us apply the model to a concrete example based on the measured time to complete state transfer reported in Section V, Table I. Figure 4(a) shows the required strength of a replica to achieve a confidence in the system of 95% over 30 years when varying the number r of rejuvenations per day and the state size. The time to transfer the state determines the maximum rejuvenation rate possible for each state size. In the presence of large state, some of the configurations are not possible. As an example, we cannot have r > 2 when the state size is terabyte, because the total time to validate and transfer that state is 8 hours and 4 minutes. Figure 4(a) shows that increasing r, when possible, helps the system to work correctly in the presence of weak replicas. In Figure 4(b) we vary the rejuvenation rate r, for different numbers of tolerated failures f. We assume the state size is large enough to allow multiple rejuvenations per day (e.g., 0, or 40 gigabytes). Augmenting the number of replicas allows the system to survive in the presence of weaker replicas, but the benefit of adding replicas decreases as f increases. The results presented in this section show that proactive recovery is possible in the presence of large state. Some papers in the literature [], [5] use an aggressive rejuvenation rate (one rejuvenation every 5, 0, or 5 minutes), but these rates limit the size of the state to transfer and incur unnecessary overhead. Our model shows that a rejuvenation rate of one replica per day requires a replica strength of 0.54 when the state size is terabyte. In addition, our model helps to determine the number of replicas to deploy. Many BFT protocols in literature are evaluated with 4 replicas. The model 9

10 0.9 Strenght of a replica Number of replica rejuvena:ons per day 48 Not possible State Size (Gb) Required strength of a replica Number of replica rejuvena9ons per day f=, k=0 f=, K= f=2, k=0 f=2, K= f=5, k=0 f=5, K= (a) (b) Fig. 5. Required strength of a replica to achieve a confidence in the system of 95% over 30 years in a system with 3f + 2k + replicas. The required strength of a replica increases compared to the case with 3f + replicas (cfr. Figures 4(a) and 4(b)). shows that a system with 4 replicas requires stronger replicas, even in the presence of multiple rejuvenations per day. At the same time, deploying systems with more than 0 replicas does not increase the resiliency of the system remarkably, and it generates additional overhead in the BFT protocol. C. Extension to a system with 3f + 2k + replicas So far, the model has assumed that rejuvenations occur instantaneously. Since the system has state, which takes time to verify or transfer, rejuvenations actually take some time to complete. In this section, we extend the model by conservatively assuming that each rejuvenation takes an entire rejuvenation round to complete. A system with 3f +2k+ replicas can tolerate f Byzantine faults and k benign faults simultaneously, as previously shown in [5]. The changes to the protocol necessary to support the additional 2k replicas are specified in Section IV-F. Since each rejuvenating replica behaves as a replica experiencing a benign fault until it finishes its rejuvenation, we choose k to be, reflecting the fact that we rejuvenate one replica at a time. Then, the system is able to tolerate f Byzantine failures while rejuvenating some other correct replica without sacrificing availability. To extend our model from 3f + to 3f + 2k +, we change the value of n accordingly. The equation in the model is unchanged. We now show the results of the same analysis as in the previous section, but with this extended model. The salient difference here is that we seek a 95% confidence that the system will remain correct and available for 30 years, when in the previous section we sought a 95% confidence that the system will remain correct but not always available for 30 years. The results of this new analysis are shown in Figures 5(a) and 5(b). Note that the curves with k = 0 in Figure 5(b) represent the same system configuration as reported in Figure 4(b). Compared to Figures 4(a) and 4(b), in Figures 5(a) and 5(b) we note an increased required strength of a replica to achieve the desired confidence in the system over its lifetime. This increase is driven by the 2k additional replicas in the system, required to maintain both correctness and availability, while the number of tolerated faults still remains the same as in the case of 3f + replicas. However, Figure 5(b) shows that this difference decreases as we increase the number of tolerated faults, f. A. Proactive rejuvenation VII. RELATED WORK Software rejuvenation [8] was introduced to increase the reliability and availability of continuously running applications in order to prevent failures over time due to software aging. Both [8], [9] formulate the rejuvenation model through a Markov chain and derive the optimal rejuvenation schedule. More recently, software rejuvenation has been used to proactively recover replicas in a malicious environment. Castro in [] was the first to present a proactive recovery protocol for BFT systems, addressing issues such as the need for unforgeable cryptographic material, rebooting from read-only memory, and efficient state transfer. In our paper we extend the work in [] by describing how to obtain fine-grain diversity and efficient state transfer in the presence of a large state. Rodrigues et al. in [7] present BASE, which extends the BFT protocol described in [20] with data abstraction techniques in order to mask software errors. The abstraction layer hides implementation details and allows the reuse of off-the-shelf software components. Replicas are periodically rejuvenated, rebooting from a clean state. Authors in [2] propose splitting the system into a synchronous component that activates periodic rejuvenation and an any-synchronous subsystem that includes the payload application. This model has been adopted later in other papers [5], [6], [22], [23]. In particular, authors in [5] enhance proactive rejuvenation with reactive recovery, which allows the recovery of replicas as soon as they are detected or suspected of being compromised. The proposed solution guarantees the availability of the minimum number of replicas required to make progress by using 2k additional replicas, where k is the maximum number of replicas that recover at the same time. In our paper we also describe how to adapt our protocol to work with 3f + 2k + replicas, and separate the proactive recovery 0

Towards a Practical Survivable Intrusion Tolerant Replication System

Towards a Practical Survivable Intrusion Tolerant Replication System Towards a Practical Survivable Intrusion Tolerant Replication System Marco Platania, Daniel Obenshain, Thomas Tantillo, Ricky Sharma, Yair Amir Department of Computer Science at Johns Hopkins University

More information

Toward Intrusion Tolerant Clouds

Toward Intrusion Tolerant Clouds Toward Intrusion Tolerant Clouds Prof. Yair Amir, Prof. Vladimir Braverman Daniel Obenshain, Tom Tantillo Department of Computer Science Johns Hopkins University Prof. Cristina Nita-Rotaru, Prof. Jennifer

More information

Network-Attack-Resilient Intrusion- Tolerant SCADA for the Power Grid

Network-Attack-Resilient Intrusion- Tolerant SCADA for the Power Grid Network-Attack-Resilient Intrusion- Tolerant SCADA for the Power Grid Amy Babay, Thomas Tantillo, Trevor Aron, Marco Platania, and Yair Amir Johns Hopkins University, AT&T Labs, Spread Concepts LLC Distributed

More information

Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid

Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid Amy Babay, Thomas Tantillo, Trevor Aron, Marco Platania, and Yair Amir Johns Hopkins University {babay, tantillo, taron1, yairamir}@cs.jhu.edu

More information

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change

More information

Toward Open Source Intrusion Tolerant SCADA. Trevor Aron JR Charles Akshay Srivatsan Mentor: Marco Platania

Toward Open Source Intrusion Tolerant SCADA. Trevor Aron JR Charles Akshay Srivatsan Mentor: Marco Platania Toward Open Source Intrusion Tolerant SCADA Trevor Aron JR Charles Akshay Srivatsan Mentor: Marco Platania Outline What is SCADA? SCADA Vulnerabilities What is Intrusion Tolerance? Prime PvBrowser Our

More information

Evaluating BFT Protocols for Spire

Evaluating BFT Protocols for Spire Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley 600.667 Advanced Distributed Systems & Networks SCADA & Spire Overview High-Performance, Scalable Spire Trusted Platform Module Known Network

More information

Key-value store with eventual consistency without trusting individual nodes

Key-value store with eventual consistency without trusting individual nodes basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed

More information

Toward Intrusion Tolerant Cloud Infrastructure

Toward Intrusion Tolerant Cloud Infrastructure Toward Intrusion Tolerant Cloud Infrastructure Daniel Obenshain, Tom Tantillo, Yair Amir Department of Computer Science Johns Hopkins University Andrew Newell, Cristina Nita-Rotaru Department of Computer

More information

An Attack-Resilient Architecture for Large-Scale Intrusion-Tolerant Replication

An Attack-Resilient Architecture for Large-Scale Intrusion-Tolerant Replication An Attack-Resilient Architecture for Large-Scale Intrusion-Tolerant Replication Yair Amir 1, Brian Coan 2, Jonathan Kirsch 1, John Lane 1 1 Johns Hopkins University, Baltimore, MD. {yairamir, jak, johnlane}@cs.jhu.edu

More information

Steward: Scaling Byzantine Fault-Tolerant Systems to Wide Area Networks

Steward: Scaling Byzantine Fault-Tolerant Systems to Wide Area Networks Steward: Scaling Byzantine Fault-Tolerant Systems to Wide Area Networks Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan Kirsch, John Lane, Cristina Nita-Rotaru, Josh Olsen, David Zage Technical Report

More information

Proactive Recovery in a Byzantine-Fault-Tolerant System

Proactive Recovery in a Byzantine-Fault-Tolerant System Proactive Recovery in a Byzantine-Fault-Tolerant System Miguel Castro and Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139

More information

Proactive Recovery in a Byzantine-Fault-Tolerant System

Proactive Recovery in a Byzantine-Fault-Tolerant System Proactive Recovery in a Byzantine-Fault-Tolerant System Miguel Castro and Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139

More information

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm Appears as Technical Memo MIT/LCS/TM-590, MIT Laboratory for Computer Science, June 1999 A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm Miguel Castro and Barbara Liskov

More information

Byzantine Fault Tolerance

Byzantine Fault Tolerance Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures

More information

Proactive Obfuscation

Proactive Obfuscation Proactive Obfuscation Tom Roeder Fred B. Schneider {tmroeder,fbs}@cs.cornell.edu Department of Computer Science Cornell University Abstract Proactive obfuscation is a new method for creating server replicas

More information

Customizable Fault Tolerance for Wide-Area Replication

Customizable Fault Tolerance for Wide-Area Replication Customizable Fault Tolerance for Wide-Area Replication Yair Amir 1, Brian Coan 2, Jonathan Kirsch 1, John Lane 1 1 Johns Hopkins University, Baltimore, MD. {yairamir, jak, johnlane}@cs.jhu.edu 2 Telcordia

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li Abstract Along with cryptocurrencies become a great success known to the world, how to

More information

Survey of Cyber Moving Targets. Presented By Sharani Sankaran

Survey of Cyber Moving Targets. Presented By Sharani Sankaran Survey of Cyber Moving Targets Presented By Sharani Sankaran Moving Target Defense A cyber moving target technique refers to any technique that attempts to defend a system and increase the complexity of

More information

Scaling Byzantine Fault-tolerant Replication to Wide Area Networks

Scaling Byzantine Fault-tolerant Replication to Wide Area Networks Scaling Byzantine Fault-tolerant Replication to Wide Area Networks Cristina Nita-Rotaru Dependable and Secure Distributed Systems Lab Department of Computer Science and CERIAS Purdue University Department

More information

Research Statement. Amy Babay December 2018

Research Statement. Amy Babay December 2018 Research Statement Amy Babay December 2018 My research focuses on distributed systems and networks, with core goals spanning two domains: enabling new Internet services and building dependable infrastructure.

More information

Resource-efficient Byzantine Fault Tolerance. Tobias Distler, Christian Cachin, and Rüdiger Kapitza

Resource-efficient Byzantine Fault Tolerance. Tobias Distler, Christian Cachin, and Rüdiger Kapitza 1 Resource-efficient Byzantine Fault Tolerance Tobias Distler, Christian Cachin, and Rüdiger Kapitza Abstract One of the main reasons why Byzantine fault-tolerant (BFT) systems are currently not widely

More information

Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography

Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography Appears as Technical Memo MIT/LCS/TM-589, MIT Laboratory for Computer Science, June 999 Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography Miguel Castro and Barbara Liskov Laboratory

More information

Byzantine Fault Tolerance

Byzantine Fault Tolerance Byzantine Fault Tolerance CS6450: Distributed Systems Lecture 10 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University.

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

ZZ: Cheap Practical BFT using Virtualization

ZZ: Cheap Practical BFT using Virtualization University of Massachusetts, Technical Report TR14-08 1 ZZ: Cheap Practical BFT using Virtualization Timothy Wood, Rahul Singh, Arun Venkataramani, and Prashant Shenoy Department of Computer Science, University

More information

SentinelOne Technical Brief

SentinelOne Technical Brief SentinelOne Technical Brief SentinelOne unifies prevention, detection and response in a fundamentally new approach to endpoint protection, driven by behavior-based threat detection and intelligent automation.

More information

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99 Practical Byzantine Fault Tolerance Castro and Liskov SOSP 99 Why this paper? Kind of incredible that it s even possible Let alone a practical NFS implementation with it So far we ve only considered fail-stop

More information

Cyber Moving Targets. Yashar Dehkan Asl

Cyber Moving Targets. Yashar Dehkan Asl Cyber Moving Targets Yashar Dehkan Asl Introduction An overview of different cyber moving target techniques, their threat models, and their technical details. Cyber moving target technique: Defend a system

More information

Mencius: Another Paxos Variant???

Mencius: Another Paxos Variant??? Mencius: Another Paxos Variant??? Authors: Yanhua Mao, Flavio P. Junqueira, Keith Marzullo Presented by Isaiah Mayerchak & Yijia Liu State Machine Replication in WANs WAN = Wide Area Network Goals: Web

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Practical Byzantine Fault Tolerance and Proactive Recovery

Practical Byzantine Fault Tolerance and Proactive Recovery Practical Byzantine Fault Tolerance and Proactive Recovery MIGUEL CASTRO Microsoft Research and BARBARA LISKOV MIT Laboratory for Computer Science Our growing reliance on online services accessible on

More information

Byzantine Replication Under Attack

Byzantine Replication Under Attack Byzantine Replication Under Attack Yair Amir 1, Brian Coan 2, Jonathan Kirsch 1, John Lane 1 1 Johns Hopkins University, Baltimore, MD. {yairamir, jak, johnlane}@cs.jhu.edu 2 Telcordia Technologies, iscataway,

More information

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous

More information

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Automatic Reconfiguration for Large-Scale Reliable Storage Systems

Automatic Reconfiguration for Large-Scale Reliable Storage Systems Automatic Reconfiguration for Large-Scale Reliable Storage Systems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Cluster-Based Scalable Network Services

Cluster-Based Scalable Network Services Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority Paxos: Fun Facts Quorum Why is the algorithm called Paxos? Leslie Lamport described the algorithm as the solution to a problem of the parliament on a fictitious Greek island called Paxos Many readers were

More information

Proactive Obfuscation

Proactive Obfuscation 4 Proactive Obfuscation TOM ROEDER Microsoft Research and FRED B. SCHNEIDER Cornell University Proactive obfuscation is a new method for creating server replicas that are likely to have fewer shared vulnerabilities.

More information

Spire: Intrusion-Tolerant SCADA for the Power Grid

Spire: Intrusion-Tolerant SCADA for the Power Grid Distributed Systems and Networks Lab Spire: Intrusion-Tolerant for the Power Grid Amy Babay*, Thomas Tantillo*, Trevor Aron, Yair Amir June 25, 2017 Distributed Systems and Networks Lab Department of Computer

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Venugopalan Ramasubramanian Emin Gün Sirer Presented By: Kamalakar Kambhatla * Slides adapted from the paper -

More information

Assessing performance in HP LeftHand SANs

Assessing performance in HP LeftHand SANs Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

ROTE: Rollback Protection for Trusted Execution

ROTE: Rollback Protection for Trusted Execution ROTE: Rollback Protection for Trusted Execution Sinisa Matetic, Mansoor Ahmed, Kari Kostiainen, Aritra Dhar, David Sommer, Arthur Gervais, Ari Juels, Srdjan Capkun Siniša Matetić ETH Zurich Institute of

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Appears in the Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999 Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Laboratory

More information

Remote Data Checking for Network Codingbased. Distributed Storage Systems

Remote Data Checking for Network Codingbased. Distributed Storage Systems CCSW 0 Remote Data Checking for Network Coding-based Bo Chen, Reza Curtmola, Giuseppe Ateniese, Randal Burns New Jersey Institute of Technology Johns Hopkins University Motivation Cloud storage can release

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin Zyzzyva Speculative Byzantine Fault Tolerance Ramakrishna Kotla L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin The Goal Transform high-performance service into high-performance

More information

A Reliable Broadcast System

A Reliable Broadcast System A Reliable Broadcast System Yuchen Dai, Xiayi Huang, Diansan Zhou Department of Computer Sciences and Engineering Santa Clara University December 10 2013 Table of Contents 2 Introduction......3 2.1 Objective...3

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Meltdown and Spectre - understanding and mitigating the threats (Part Deux)

Meltdown and Spectre - understanding and mitigating the threats (Part Deux) Meltdown and Spectre - understanding and mitigating the threats (Part Deux) Gratuitous vulnerability logos Jake Williams @MalwareJake SANS / Rendition Infosec sans.org / rsec.us @SANSInstitute / @RenditionSec

More information

Spire: Intrusion-Tolerant SCADA for the Power Grid

Spire: Intrusion-Tolerant SCADA for the Power Grid Distributed Systems and Networks Lab Spire: Intrusion-Tolerant for the Power Grid Amy Babay*, Thomas Tantillo*, Trevor Aron, Yair Amir June 25, 2017 Distributed Systems and Networks Lab Department of Computer

More information

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18 Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

416 Distributed Systems. Distributed File Systems 4 Jan 23, 2017

416 Distributed Systems. Distributed File Systems 4 Jan 23, 2017 416 Distributed Systems Distributed File Systems 4 Jan 23, 2017 1 Today's Lecture Wrap up NFS/AFS This lecture: other types of DFS Coda disconnected operation 2 Key Lessons Distributed filesystems almost

More information

Semi-Passive Replication in the Presence of Byzantine Faults

Semi-Passive Replication in the Presence of Byzantine Faults Semi-Passive Replication in the Presence of Byzantine Faults HariGovind V. Ramasamy Adnan Agbaria William H. Sanders University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana IL 61801, USA

More information

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast HariGovind V. Ramasamy Christian Cachin August 19, 2005 Abstract Atomic broadcast is a communication primitive that allows a group of

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

AS distributed systems develop and grow in size,

AS distributed systems develop and grow in size, 1 hbft: Speculative Byzantine Fault Tolerance With Minimum Cost Sisi Duan, Sean Peisert, Senior Member, IEEE, and Karl N. Levitt Abstract We present hbft, a hybrid, Byzantine fault-tolerant, ted state

More information

THE phenomenon that the state of running software

THE phenomenon that the state of running software TRANSACTION ON DEPENDABLE AND SECURE COMPUTING 1 Fast Software Rejuvenation of Virtual Machine Monitors Kenichi Kourai, Member, IEEE Computer Society, and Shigeru Chiba Abstract As server consolidation

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Important Lessons. Today's Lecture. Two Views of Distributed Systems Important Lessons Replication good for performance/ reliability Key challenge keeping replicas up-to-date Wide range of consistency models Will see more next lecture Range of correctness properties L-10

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers 1 HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers Vinit Kumar 1 and Ajay Agarwal 2 1 Associate Professor with the Krishna Engineering College, Ghaziabad, India.

More information

A FAULT- AND INTRUSION-TOLERANT ARCHITECTURE FOR THE PORTUGUESE POWER DISTRIBUTION SCADA

A FAULT- AND INTRUSION-TOLERANT ARCHITECTURE FOR THE PORTUGUESE POWER DISTRIBUTION SCADA A FAULT- AND INTRUSION-TOLERANT ARCHITECTURE FOR THE PORTUGUESE POWER DISTRIBUTION SCADA Nuno Medeiros Alysson Bessani 1 Context: EDP Distribuição EDP Distribuição is the utility responsible for the distribution

More information

GFS-python: A Simplified GFS Implementation in Python

GFS-python: A Simplified GFS Implementation in Python GFS-python: A Simplified GFS Implementation in Python Andy Strohman ABSTRACT GFS-python is distributed network filesystem written entirely in python. There are no dependencies other than Python s standard

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT - 1 Mobile Data Management: Mobile Transactions - Reporting and Co Transactions Kangaroo Transaction Model - Clustering Model Isolation only transaction 2 Tier Transaction Model Semantic based nomadic

More information

Byzantine Fault Tolerant Raft

Byzantine Fault Tolerant Raft Abstract Byzantine Fault Tolerant Raft Dennis Wang, Nina Tai, Yicheng An {dwang22, ninatai, yicheng} @stanford.edu https://github.com/g60726/zatt For this project, we modified the original Raft design

More information

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei VIAF: Verification-based Integrity Assurance Framework for MapReduce YongzhiWang, JinpengWei MapReduce in Brief Satisfying the demand for large scale data processing It is a parallel programming model

More information

Increasing Network Resiliency by Optimally Assigning Diverse Variants to Routing Nodes

Increasing Network Resiliency by Optimally Assigning Diverse Variants to Routing Nodes Increasing Network Resiliency by Optimally Assigning Diverse Variants to Routing Nodes Andrew Newell 1, Daniel Obenshain 2, Thomas Tantillo 2, Cristina Nita-Rotaru 1, and Yair Amir 2 1 Department of Computer

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS

ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS Proceedings of the 13th International Conference on Parallel Processing, Bellaire, Michigan, pp. 32-41, August 1984. ERROR RECOVERY I MULTICOMPUTERS USIG GLOBAL CHECKPOITS Yuval Tamir and Carlo H. Séquin

More information

Cryptographic Primitives and Protocols for MANETs. Jonathan Katz University of Maryland

Cryptographic Primitives and Protocols for MANETs. Jonathan Katz University of Maryland Cryptographic Primitives and Protocols for MANETs Jonathan Katz University of Maryland Fundamental problem(s) How to achieve secure message authentication / transmission in MANETs, when: Severe resource

More information

CS268: Beyond TCP Congestion Control

CS268: Beyond TCP Congestion Control TCP Problems CS68: Beyond TCP Congestion Control Ion Stoica February 9, 004 When TCP congestion control was originally designed in 1988: - Key applications: FTP, E-mail - Maximum link bandwidth: 10Mb/s

More information

Towards Recoverable Hybrid Byzantine Consensus

Towards Recoverable Hybrid Byzantine Consensus Towards Recoverable Hybrid Byzantine Consensus Hans P. Reiser 1, Rüdiger Kapitza 2 1 University of Lisboa, Portugal 2 University of Erlangen-Nürnberg, Germany September 22, 2009 Overview 1 Background Why?

More information

ZZ and the Art of Practical BFT

ZZ and the Art of Practical BFT University of Massachusetts, Technical Report 29-24 ZZ and the Art of Practical BFT Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, and Emmanuel Cecchet Department of Computer Science,

More information

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems Fall 2017 Exam 3 Review Paul Krzyzanowski Rutgers University Fall 2017 December 11, 2017 CS 417 2017 Paul Krzyzanowski 1 Question 1 The core task of the user s map function within a

More information

A Traceback Attack on Freenet

A Traceback Attack on Freenet A Traceback Attack on Freenet Guanyu Tian, Zhenhai Duan Florida State University {tian, duan}@cs.fsu.edu Todd Baumeister, Yingfei Dong University of Hawaii {baumeist, yingfei}@hawaii.edu Abstract Freenet

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

STAR: Scaling Transactions through Asymmetric Replication

STAR: Scaling Transactions through Asymmetric Replication STAR: Scaling Transactions through Asymmetric Replication arxiv:1811.259v2 [cs.db] 2 Feb 219 ABSTRACT Yi Lu MIT CSAIL yilu@csail.mit.edu In this paper, we present STAR, a new distributed in-memory database

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Technical Report: Increasing Network Resiliency by Optimally Assigning Diverse Variants to Routing Nodes

Technical Report: Increasing Network Resiliency by Optimally Assigning Diverse Variants to Routing Nodes Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 2013 Technical Report: Increasing Network Resiliency by Optimally Assigning Diverse Variants

More information

VMware vsphere Clusters in Security Zones

VMware vsphere Clusters in Security Zones SOLUTION OVERVIEW VMware vsan VMware vsphere Clusters in Security Zones A security zone, also referred to as a DMZ," is a sub-network that is designed to provide tightly controlled connectivity to an organization

More information

Security (and finale) Dan Ports, CSEP 552

Security (and finale) Dan Ports, CSEP 552 Security (and finale) Dan Ports, CSEP 552 Today Security: what if parts of your distributed system are malicious? BFT: state machine replication Bitcoin: peer-to-peer currency Course wrap-up Security Too

More information

SentinelOne Technical Brief

SentinelOne Technical Brief SentinelOne Technical Brief SentinelOne unifies prevention, detection and response in a fundamentally new approach to endpoint protection, driven by machine learning and intelligent automation. By rethinking

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information