Replicated State Machine (RSM) via Paxos Recall: we have implemented a primary-back storage Draw diagram --> B1 P --> B2 / \ C1 C2 Primary: order all operations, replicate them to backup Backup: execute operations in assigned order. RSM (a generalization) - operations can be arbitary *deterministic* operations PUT, ADD_TO_SET, APPEND_TO_LIST, etc. RSM's correctness requirements: (generalizing primary/backup) - all replicas agree on the set of replicas * primary/backup uses a fixed set of replicas - all replicas agree on the set and order of operations. * primary/backup uses a designated replica (primary) to accept and order operations. Problem with primary/backup: - what if a replica fails? if a backup fails, primary can skip it what if the primary fail? can B1 become the primary? what if primary did not fail and there's a network partition between primary and backups? - The heart of the issue Can nodes agree on who's the primary and who are the backups in a fault-tolerant manner? This is addressed by distributed consensus. Challenge of consensus: - an RPC timeout does not tell you anything about node failure or network partition. - there's no assumption on time. Let us simplify the consensus problem: 3 nodes, S1, S2, S3 Each node can independently propose a value Desired outcome: - safety: all agree on the same value (that value must be one proposed by some node as opposed to a silly default) - liveness: agreement eventually happen as long as a majority of nodes can talk to each other. Here are some strawman solutions: Strawman #1: - every node asks the proposed value of every other node, puts it in the set. - every node waits till it heard back from majority, then chooses the value proposed by the smallest node (based on nodeID). (problem: S1 heard back from S1,S2, S2 heard back from S2,S3) Strawman #2: - every node sends its proposed value to every other node - Node accepts first value seen, rejects the rest - If a node has received acceptance from "x" other nodes, choose its value and tell others about it. Strawman #3: - if a node has not received acceptance from majority nodes, repeat strawman #2. - Node accepts value with highest timestamp instead of the first value seen. (problem: The main ideas in Paxos: 1. 3 phases - to obtain any previously majority-accepted value - to seek majority acceptance - to notify chosen value 2. multiple rounds if a round does not success, start another round 2. a majority is required for agreement -- prevent "split brain" a key point: any two majorities overlap any later majority will share at least one server w/ any earlier majority so any later majority can find out what earlier majority decided Paxos sketch each node consists of three logical entities: proposer acceptor learner each proposer wants to get agreement on its value OK to have multiple proposers proposer contacts acceptors, tries to assemble a majority if a majority respond, we're done basic Paxos exchange: proposer acceptors prepare(n) -> <- prepare_ok(n, n_a, v_a) accept(n, v') -> <- accept_ok(n) decided(v') -> why n? to distinguish among multiple rounds, e.g. proposer crashes, simul props want later rounds to supersede earlier ones numbers allow us to compare early/late n values must be unique and roughly follow time n = or ID can be server's IP address The crucial property: if a value was accepted by a majority, any subsequent choice must be the same value i.e. protocol must not change its mind maybe a different proposer &c, but same value! this allows us to freely start new rounds after crashes &c That's why: proposer doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, proposer proposes that value to avoid changing an existing choice if no value already accepted, proposer can propose any value (e.g. a client request) proposer must get prepare_ok from majority to guarantee intersection with any previous majority, to guarantee proposer hears of any previously chosen value now the protocol -- see the handout proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to all servers including self if prepare_ok(n, n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor state: must persist across reboots n_h (highest prepare seen) n_a, v_a (highest accept seen) acceptor's prepare(n) handler: if n > n_h n_h = n reply prepare_ok(n, n_a, v_a) else reply prepare_notok(n, n_h) acceptor's accept(n, v) handler: if n >= n_h n_h = n n_a = n v_a = v reply accept_ok(n) else reply accept_notok(n, n_h) example 1 (normal operation): S0, S1, S2 but S2 is dead or slow S0 starts proposal, n=1 v=A S1: p1 a1vA dA S2: p1 a1vA dA S3: dead... "p1" means S* has successfully prepared(n=1) "a1vA" means S* has accepted (n=1, v=A) "dA" means S* receives decided(v=A) Note Paxos only requires a majority of the servers so we can continue even though S3 was down proposer must not wait forever for any acceptor's response What would happen if network partition? I.e. S3 was alive and had a proposed value B S3's prepare would not assemble a majority More examples: How does Paxos ensure that the following sequence of events can't happen? What actually happens, and which value is ultimately chosen? proposer 1 crashes after sending two accepts proposer 2 has a different value in mind S1: p1 a1foo S2: p1 p2 a2bar S3: p1 a1foo p2 a2bar S3's prepare_ok to S2 really included "foo" thus should be a2foo, and so no problem the point: if the system has already reached agreement, majority will know value any new majority of prepares will intersect that majority so subsequent proposer will learn of already-agreed-on value and send it in accept msgs example 2 (concurrent proposers): S1 starts proposing n=10 S1 sends out just one accept v=10 S3 starts proposing n=11 but S1 does not receive its proposal S3 only has to wait for a majority of proposal responses S1: p10 a10foo S2: p10 p11 S3: p10 p11 a11bar But S1 and S3 have accepted different values! what will happen? what will S2 do if it later gets a10v10 accept msg from S1? (reject) what will S1 do if it later gets a11v11 accept msg from S3 (ignore)? what if S3 were to crash at this point (and not restart)? how about this: S1: p10 a10v10 p12 S2: p10 p11 a11v11 S3: p10 p11 p12 a12v10 has the system agreed to a value at this point? what's the commit point? i.e. exactly when has agreement been reached? i.e. at what point would changing the value be a disaster? after a majority has the same v_a/n_a? yes -- why sufficient? sketch: suppose majority has same v_a/n_a acceptors will reject accept() with lower n for any higher n: prepare's must have seen our majority v_a/n_a (overlap) why does the proposer need to pick v_a with highest n_a? S1: p10 a10vA p12 S2: p10 p11 a11vB S3: p10 p11 a11vB p12 a12v?? n=11 already agreed on vB n=12 sees both vA and vB, but must choose vB why: if a majority has accepted, then the highest n_a contains the majority-accepted value why does prepare handler check that n > n_h? responding to all prepare() with prepare_ok() would be also fine, (but still needs to update n_h) proposers with n < n_h would be later ignored by accept() anyway why does accept handler check n >= n_h? required to ensure there is a unique majority w/o n >= n_h check, you could get this bad scenario: S1: p1 p2 a1vA S2: p1 p2 a1vA a2vB S3: p1 p2 a2vB why does accept handler update n_h = n? required to prevent earlier n's from being accepted node can get accept(n,v) even though it never saw prepare(n) without n_p = n, can get this bad scenario: S1: p1 a2vB a1vA p3 a3vA S2: p1 p2 p3 a3vA S3: p2 a2vB what if an acceptor crashes after receiving accept? S1: p1 a1v1 S2: p1 a1v1 reboot p2 a2v? S3: p1 p2 a2v? S2 must remember v_a/n_a across reboot! on disk might be only intersection with new proposer's majority and thus only evidence that already agreed on v1 what if an acceptor reboots after sending prepare_ok? does it have to remember n_h on disk? if n_h not remembered, this could happen: S1: p10 a10v10 S2: p10 p11 reboot a10v10 a11v11 S3: p11 a11v11 11's proposer did not see value 10, so 11 proposed its own value but just before that, 10 had been chosen! b/c S2 did not remember to ignore a10v10 can Paxos get stuck? yes, if there is not a majority that can communicate how about if a majority is available? possible to livelock: S1: p1 p2 p3... S2: p1 p2 p3... S3: p1 p2 p3... --- Parts of the notes are due to Robert Morris