Replicated State Machine (RSM) via Paxos Recall: Replicated state machine - Deterministic operations - Servers execute operations in the same order We've seen the primary-backup implementation of RSM - Primary: order all operations, replicate them to backup - Backup: execute operations in assigned order. RSM's correctness requirements: (generalizing primary/backup) - all replicas agree on the set of replicas * primary/backup uses a fixed set of replicas - all replicas agree on the set and order of operations. * primary/backup uses a designated replica (primary) to accept and order operations. Problem with primary/backup: - what if a replica fails? if a backup fails, primary can skip it what if the primary fail? can B1 become the primary? what if primary did not fail and there's a network partition between primary and backups? - The heart of the issue Can nodes agree on who's the primary and who are the backups in a fault-tolerant manner? This is addressed by distributed consensus. Challenge of consensus: - a node can fail (but its failure cannot be accurately detected) - an RPC timeout can mean any of the following: * node failure * node slow * network partition. * network slow. - also referred to as the "asynchronous network" assumption Let us simplify the consensus problem: 3 nodes, S1, S2, S3 Each node can independently propose a value Desired outcome: - safety: all agree on the same value (that value must be one proposed by some node as opposed to a silly default) - liveness: agreement eventually happen even if one node can fail Here are some strawman solutions: Strawman #0: - every node sends its proposed value to every other node - every node waits to receive from all other nodes. Chooses the value with the smallest-node-id. Strawman #1: - every node sends its proposed value to every other node - every node chooses the first value seen, ignore the rest. Strawman #2: - every node sends its proposed value to every other node - Node accepts first value seen, rejects the rest - If a node has received acceptance from "x" other nodes, its value is chosen, it sends victory-message to all others. Strawman #3: - same as strawman #2, except a node forgets its accepted value after 10 seconds if it has not seen any victory message. The main ideas in Paxos: 1. 3 phases - to obtain any previously majority-accepted value - to seek majority acceptance - to notify chosen value 2. multiple rounds if a round does not success, start another round 2. a majority is required for agreement -- prevent "split brain" a key point: any two majorities overlap any later majority will share at least one server w/ any earlier majority so any later majority can find out what earlier majority decided Paxos sketch each node consists of two logical entities: proposer acceptor each proposer wants to get agreement on its value OK to have multiple proposers proposer contacts acceptors, tries to assemble a majority if a majority respond, we're done basic Paxos exchange: proposer acceptors prepare(n) -> <- prepare_ok(n, n_a, v_a) accept(n, v') -> <- accept_ok(n) decided(v') -> proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to all servers including self if prepare_ok(n, n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor state: must persist across reboots n_h (highest prepare seen) n_a, v_a (highest accept seen) acceptor's prepare(n) handler: if n > n_h n_h = n reply prepare_ok(n, n_a, v_a) else reply prepare_reject(n, n_h) acceptor's accept(n, v) handler: if n >= n_h n_h = n n_a = n v_a = v reply accept_ok(n) else reply accept_reject(n, n_h) why n? (proposal #, or ballot #) to distinguish among multiple rounds, e.g. proposer crashes, simul props want later rounds to supersede earlier ones numbers allow us to compare early/late n values must be unique and roughly follow time n = or ID can be server's IP address The crucial property: if a value was accepted by a majority, any subsequent choice must be the same value i.e. protocol must not change its mind maybe a different proposer &c, but same value! this allows us to freely start new rounds after crashes &c That's why: proposer doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, proposer proposes that value to avoid changing an existing choice if no value already accepted, proposer can propose any value (e.g. a client request) proposer must get prepare_ok from majority to guarantee intersection with any previous majority, to guarantee proposer hears of any previously chosen value example 1 (normal operation): S0, S1, S2 but S2 is dead or slow S0 starts proposal, n=1 v=A S1: p1 a1"foo" d"foo" S2: p1 a1"foo" d"foo" S3: dead... "p1" means S* has successfully prepared(n=1) "a1vA" means S* has accepted (n=1, v=A) "dA" means S* receives decided(v=A) Note Paxos only requires a majority of the servers so we can continue even though S3 was down proposer must not wait forever for any acceptor's response What would happen if network partition? I.e. S3 was alive and had a proposed value B S3's prepare would not assemble a majority More examples: How does Paxos ensure that the following sequence of events can't happen? What actually happens, and which value is ultimately chosen? proposer 1 crashes after sending two accepts proposer 2 has a different value in mind S1: p1 a1foo S2: p1 p2 a2bar S3: p1 a1foo p2 a2bar S3's prepare_ok to S2 really included "foo" thus should be a2foo, and so no problem the point: if the system has already reached agreement, majority will know value any new majority of prepares will intersect that majority so subsequent proposer will learn of already-agreed-on value and send it in accept msgs example 2 (concurrent proposers): S1 starts proposing n=10 S1 sends out just one accept v=10 S3 starts proposing n=11 but S1 does not receive its proposal S3 only has to wait for a majority of proposal responses S1: p10 a10foo S2: p10 p11 S3: p10 p11 a11bar But S1 and S3 have accepted different values! what will happen? what will S2 do if it later gets a10foo accept msg from S1? (reject) what will S1 do if it later gets a11bar accept msg from S3 (ignore)? what if S3 were to crash at this point (and not restart)? how about this: S1: p10 a10foo p12 S2: p10 p11 a11bar S3: p10 p11 p12 a12foo has the system agreed to a value at this point? what's the commit point? i.e. exactly when has agreement been reached? i.e. at what point would changing the value be a disaster? after a majority has the same v_a/n_a? yes -- why sufficient? sketch: suppose majority has same v_a/n_a acceptors will reject accept() with lower n for any higher n: prepare's must have seen our majority v_a/n_a (overlap) why does the proposer need to pick v_a with highest n_a? S1: p10 a10foo p12 S2: p10 p11 a11bar S3: p10 p11 a11bar p12 a12foo?? n=11 already agreed on bar n=12 sees both foo and bar, but must choose bar why: if a majority has accepted, then the highest n_a contains the majority-accepted value why does prepare handler check that n > n_h? responding to all prepare() with prepare_ok() would be also fine, (but still needs to update n_h) proposers with n < n_h would be later ignored by accept() anyway why does accept handler check n >= n_h? required to ensure there is a unique majority w/o n >= n_h check, you could get this bad scenario: S1: p1 p2 a1foo S2: p1 p2 a1foo a2bar S3: p1 p2 a2bar why does accept handler update n_h = n? required to prevent earlier n's from being accepted node can get accept(n,v) even though it never saw prepare(n) without n_h = n, can get this bad scenario: S1: p1 a2bar a1foo p3 a3foo S2: p1 p2 p3 a3foo S3: p2 a2bar Why does acceptor have to remember n_a,n_v across crash? S1: p1 a1foo S2: p1 a1foo reboot p2 a2bar? S3: p1 p2 a2bar? S2 must remember v_a/n_a across reboot! on disk might be only intersection with new proposer's majority and thus only evidence that already agreed on foo what if an acceptor reboots after sending prepare_ok? does it have to remember n_h on disk? if n_h not remembered, this could happen: S1: p10 a10foo S2: p10 p11 reboot a10foo a11bar S3: p11 a11bar S1 proposer did not see foo, so S1 proposed its own value but just before that, foo had been chosen! b/c S2 did not remember a10v10 can Paxos get stuck? yes, if there is not a majority that can communicate how about if a majority is available? possible to livelock: S1: p1 p2 p3... S2: p1 p2 p3... S3: p1 p2 p3... Paxos guarantees: - safety: all nodes decide on the same value (consensus) - liveness: live nodes eventually decide if a majority nodes are alive and can talk to each other w.h.p. ----- How to use Paxos to implement RSM ------ RSM: all replicas execute the same set of ops in the same sequence 0: op0 1: op1 2: op2 3: op3 ... Idea: allow Paxos to be able to agree on more than one value. One instance ---> one agreement run Paxos to agree on what op should be at a specific seqno. One seqno ---> one agreement Paxos provides consensus on what op at a seqno. Not sufficient. Other invariants: - A replica should propose op at seqno s only if all seqno < s have been decided (by Paxos) (to preserve linearizability's completion-to-issue order) - Above serializes operations, no good. A better invariant - Delay reply to clients for seqno s until all seqno < s have been decided. Counter-examples: P(x=1) P_ok(x=1) P(x=2) P_ok(x=2) S1: (s=2)p1 (s=2)a1(x=1) P_ok(x=1) S2: (s=2)p1 (s=2)a1(x=1) (s=1)p1 (s=1)a1(x=2) S3: (s=1)p1 (s=1)a1(x=2) --------- Optimization (practical Paxos) ------- Elect a leader (some external mechanism, might be wrong) * clients send requests to leader * (hopefully there's one leader) only leader proposes * one RTT instead of two RTTs S1:(s=1-\inf)p1 (s=1)a1(x=10) (s=1)d(x=10) (s=2)a1(x=11) crash S2:(s=1-\inf)p1 (s=1)a1(x=10) (s=1)d(x=10) (s=2)a1 (s=1-\inf)p2(reply(s=1)d(x=10)(s=2)a(x=11)(s=2-\inf)nil (s=2)a2(x=11) S3:(s=1-\inf)p1 (s=1-\inf)p2 (s=2)a2(x=11)