Transparent Recovery with Chain Replication Robert Burgess Ken Birman Robert Broberg Rick Payne Robbert van Renesse October 26, 2009
Motivation Us:
Motivation Them: Client
Motivation There is a connection... Client
Motivation Client We keep some application state...
Motivation Client... and so do they
Motivation The server fails! Client
Motivation... and is revived Client But two things are still missing!
Motivation Look! A persistent store! Client
Motivation checkpoint! Client
Motivation Client recover!
Motivation Client recovers application state
Motivation Client What recovers connection state?
Why bother? As soon as the client tries to send... Client
Why bother? RST! Client
Why bother? Client I ll just try again!
Why bother? Client
Why bother? Client
The client Client
The client Them: Client
The client Humans see Connection Reset Them: Client
The client Humans see Connection Reset When should the client retry? Them: Client
The client Humans see Connection Reset When should the client retry? Them: When should the client give up? Client
The client Humans see Connection Reset When should the client retry? Them: When should the client give up? What session maps to a new connection? Client
The client Humans see Connection Reset When should the client retry? Them: When should the client give up? What session maps to a new connection? Client Some protocols would need re-authentication
The client Humans see Connection Reset When should the client retry? Them: When should the client give up? Client What session maps to a new connection? Some protocols would need re-authentication Some protocols already respond actively... BGP assumes link is lost! Resync is slow
What are the possibilities?
What are the possibilities?
What are the possibilities?
The network stack
The network stack Has the state and the logic
The network stack Has the state and the logic No copies or context switches
The network stack Has the state and the logic No copies or context switches Fork network stack
The network stack Has the state and the logic No copies or context switches Fork network stack Redundant logging
The network stack Has the state and the logic No copies or context switches Fork network stack Redundant logging Synchronous replication
The network stack Has the state and the logic No copies or context switches Fork network stack Redundant logging Synchronous replication End-to-end, only use kernel for efficiency
Network stack wrappers [FT-]
Network stack wrappers [FT-] Still some kernel advantages
Network stack wrappers [FT-] Still some kernel advantages No change to server or
Network stack wrappers [FT-] Still some kernel advantages No change to server or Two kernel modules
Network stack wrappers [FT-] Still some kernel advantages No change to server or Two kernel modules Must interpose on socket calls
Network stack wrappers [FT-] Still some kernel advantages No change to server or Two kernel modules Must interpose on socket calls Synchronous replication
The server
The server User-level networking
The server User-level networking state includes
The server User-level networking state includes Can t leverage OS
The server User-level networking state includes Can t leverage OS Significant server changes
A proxy [CRAFT, I-]
A proxy [CRAFT, I-] Splice (spoof) separate connections
A proxy [CRAFT, I-] Splice (spoof) separate connections Replicate for fault-tolerance
A proxy [CRAFT, I-] Splice (spoof) separate connections Replicate for fault-tolerance Recover by reconnecting
A proxy [CRAFT, I-] Splice (spoof) separate connections Replicate for fault-tolerance Recover by reconnecting state machine replication
A proxy [CRAFT, I-] Splice (spoof) separate connections Replicate for fault-tolerance Recover by reconnecting state machine replication Connections aren t really connected
A man in the middle [Morris, ST-]
A man in the middle [Morris, ST-] Little or no overhead
A man in the middle [Morris, ST-] Little or no overhead Guesswork
A man in the middle [Morris, ST-] Little or no overhead Guesswork No control over server
A man in the middle [Morris, ST-] Little or no overhead Guesswork No control over server What about recovery?
R
R Little or no overhead
R Little or no overhead No guesswork
R Little or no overhead No guesswork Can control server
R Little or no overhead No guesswork Can control server Chain replication
R Little or no overhead No guesswork Can control server Chain replication Machine-independent
R Little or no overhead No guesswork Can control server Chain replication Machine-independent Replicas don t grok
R Little or no overhead No guesswork Can control server Chain replication Machine-independent Replicas don t grok But what about recovery?
Recovery How we recover determines what we must replicate
Recovery If we replicate state machines
Recovery If we replicate state machines and we control the stack
Recovery If we replicate state machines and we control the stack Can failover TCBs
Recovery We must restart the connection
Recovery We must restart the connection with an unchanged stack
Recovery Recovery We must restart the connection with an unchanged stack Recovery process notifies chain
Recovery Recovery We must restart the connection with an unchanged stack Recovery process notifies chain and makes a new connection
Recovery Recovery We must restart the connection with an unchanged stack Recovery process notifies chain and makes a new connection Packets modified to spoof real client
Recovery Recovery We must restart the connection with an unchanged stack Recovery process notifies chain and makes a new connection Packets modified to spoof real client handshakes normally (chain maintains spoofing)
Recovery Recovery Connection replayed from checkpoint
Recovery Recovery Connection replayed from checkpoint Client traffic can continue ignored until it catches up
Recovery Recovery Connection replayed from checkpoint Client traffic can continue ignored until it catches up Connection recovered!
Replaying client data
Replaying client data Replay from beginning No need for server to tell recovered from new must be deterministic Memory-intensive in common case Slow recovery
Replaying client data Replay from beginning No need for server to tell recovered from new must be deterministic Memory-intensive in common case Slow recovery Replay from explicit checkpoint Can be requested from server or administrator may need to distinguish recovered connections (getpeername) Bounded data to store and replay
Replaying client data Replay from beginning No need for server to tell recovered from new must be deterministic Memory-intensive in common case Slow recovery Replay from explicit checkpoint Can be requested from server or administrator may need to distinguish recovered connections (getpeername) Bounded data to store and replay One step further:
Replaying client data Replay from beginning No need for server to tell recovered from new must be deterministic Memory-intensive in common case Slow recovery Replay from explicit checkpoint Can be requested from server or administrator may need to distinguish recovered connections (getpeername) Bounded data to store and replay One step further: Hold ACKs until checkpointed
Replaying client data Replay from beginning No need for server to tell recovered from new must be deterministic Memory-intensive in common case Slow recovery Replay from explicit checkpoint Can be requested from server or administrator may need to distinguish recovered connections (getpeername) Bounded data to store and replay One step further: Hold ACKs until checkpointed No need to store or replay packets at all!
It can t be that simple...
It can t be that simple... Initial Sequence Numbers We can pick the correct client sequence number The server picks its own We ll have to patch up all future packets Change sequence, recompute checksum
It can t be that simple... Initial Sequence Numbers We can pick the correct client sequence number The server picks its own We ll have to patch up all future packets Change sequence, recompute checksum Fragmentation Assume that endpoints use a reasonable MSS Or that driver program handles reassembly
It can t be that simple... Initial Sequence Numbers We can pick the correct client sequence number The server picks its own We ll have to patch up all future packets Change sequence, recompute checksum Fragmentation Assume that endpoints use a reasonable MSS Or that driver program handles reassembly Selective ACKnowledgements SACKs are advisory only Let SACKs flow normally Do nothing to recover those packets if lost
It can t be that simple... Initial Sequence Numbers We can pick the correct client sequence number The server picks its own We ll have to patch up all future packets Change sequence, recompute checksum Fragmentation Assume that endpoints use a reasonable MSS Or that driver program handles reassembly Selective ACKnowledgements SACKs are advisory only Let SACKs flow normally Do nothing to recover those packets if lost MD5 security Adds an additional checksum with symmetric key Administrator must provide key information
Implementation
Implementation librtcp or administrator functions Set up server (with key information) Send checkpoints to replicas Recover connection Replica functions Process packet (agnostic of transport)
Implementation librtcp or administrator functions Set up server (with key information) Send checkpoints to replicas Recover connection Replica functions Process packet (agnostic of transport) Potentially many driver programs Current rtcp program uses Linux netfilter QUEUE 1. Pull packet off queue 2. Let library process packet 3. Permit delivery
Implementation librtcp or administrator functions Set up server (with key information) Send checkpoints to replicas Recover connection Replica functions Process packet (agnostic of transport) Potentially many driver programs Current rtcp program uses Linux netfilter QUEUE 1. Pull packet off queue 2. Let library process packet 3. Permit delivery In the future, Feather-Weight Pipes!
Implementation status Replicas log and forward packets Connections are tracked and kept up-to-date No checkpointing or recovery yet No MD5 option handling yet
Conclusion R Chain replication enables cheap consistency Leverages existing stacks Platform-independent Can run on any machine in network Perhaps minor changes to server (or automatic administrator) to checkpoint and recover
Conclusion R Chain replication enables cheap consistency Leverages existing stacks Platform-independent Can run on any machine in network Perhaps minor changes to server (or automatic administrator) to checkpoint and recover Simple, simple, simple, simple. Fast?