Replicated State Machine (RSM) via Paxos

Recall: Replicated state machine
- Deterministic operations
- Servers execute operations in the same order

We've seen the primary-backup implementation of RSM
- Primary: order all operations, replicate them to backup
- Backup: execute operations in assigned order.

RSM's correctness requirements: (generalizing primary/backup)
  - all replicas agree on the set of replicas
    * primary/backup uses a fixed set of replicas
  - all replicas agree on the set and order of operations. 
    * primary/backup uses a designated replica (primary) to accept and order operations.

Problem with primary/backup:
  - what if a replica fails?
    if a backup fails, primary can skip it
    what if the primary fail? can B1 become the primary?
    what if primary did not fail and there's a network partition between primary and backups?
  - The heart of the issue
    Can nodes agree on who's the primary and who are the backups in a fault-tolerant manner?
    This is addressed by distributed consensus.

Challenge of consensus:
  - a node can fail (but its failure cannot be accurately detected)
  - an RPC timeout can mean any of the following:
	* node failure 
        * node slow
        * network partition.
	* network slow.
  - also referred to as the "asynchronous network" assumption

Let us simplify the consensus problem:
  3 nodes, S1, S2, S3
  Each node can independently propose a value 
  Desired outcome: 
  - safety: all agree on the same value (that value must be one proposed by some node as opposed to a silly default)
  - liveness: agreement eventually happen even if one node can fail 

Here are some strawman solutions:
Strawman #0:
  - every node sends its proposed value to every other node
  - every node waits to receive from all other nodes. Chooses the value with the smallest-node-id.

Strawman #1:
  - every node sends its proposed value to every other node
  - every node chooses the first value seen, ignore the rest. 

Strawman #2:
  - every node sends its proposed value to every other node
  - Node accepts first value seen, rejects the rest
  - If a node has received acceptance from "x" other nodes, its value is chosen, it sends victory-message to all others.

Strawman #3:
  - same as strawman #2, except a node forgets its accepted value after 10 seconds if it has not seen any victory message.

The main ideas in Paxos:
  1. 3 phases
     - to obtain any previously majority-accepted value
     - to seek majority acceptance
     - to notify chosen value
  2. multiple rounds
     if a round does not success, start another round
  2. a majority is required for agreement -- prevent "split brain"
     a key point: any two majorities overlap
     any later majority will share at least one server w/ any earlier majority
     so any later majority can find out what earlier majority decided

Paxos sketch
  each node consists of two logical entities:
    proposer
    acceptor
  each proposer wants to get agreement on its value
    OK to have multiple proposers
  proposer contacts acceptors, tries to assemble a majority
    if a majority respond, we're done
  
basic Paxos exchange:
 proposer        acceptors
     prepare(n) ->
  <- prepare_ok(n, n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)
     decided(v') ->

proposer(v):
  choose n, unique and higher than any n seen so far
  send prepare(n) to all servers including self
  if prepare_ok(n, n_a, v_a) from majority:
    v' = v_a with highest n_a; choose own v otherwise
    send accept(n, v') to all
    if accept_ok(n) from majority:
      send decided(v') to all

acceptor state:
  must persist across reboots
  n_h (highest prepare seen)
  n_a, v_a (highest accept seen)

acceptor's prepare(n) handler:
  if n > n_h
    n_h = n
    reply prepare_ok(n, n_a, v_a)
  else
    reply prepare_reject(n, n_h)

acceptor's accept(n, v) handler:
  if n >= n_h
    n_h = n
    n_a = n
    v_a = v
    reply accept_ok(n)
  else
    reply accept_reject(n, n_h)

why n? (proposal #, or ballot #)
  to distinguish among multiple rounds, e.g. proposer crashes, simul props
  want later rounds to supersede earlier ones
  numbers allow us to compare early/late
  n values must be unique and roughly follow time
  n = <time, server ID> or <counter, server ID>
    ID can be server's IP address

The crucial property:
  if a value was accepted by a majority, any subsequent choice must be the same value
    i.e. protocol must not change its mind
    maybe a different proposer &c, but same value!
    this allows us to freely start new rounds after crashes &c

That's why:
  proposer doesn't send out value with prepare
  acceptors send back any value they have already accepted
  if there is one, proposer proposes that value
    to avoid changing an existing choice
  if no value already accepted,
    proposer can propose any value (e.g. a client request)
  proposer must get prepare_ok from majority
    to guarantee intersection with any previous majority,
    to guarantee proposer hears of any previously chosen value

example 1 (normal operation):
  S0, S1, S2
  but S2 is dead or slow
  S0 starts proposal, n=1 v=A

S1: p1    a1"foo"    d"foo"
S2: p1    a1"foo"    d"foo"
S3: dead...
"p1" means S* has successfully prepared(n=1) 
"a1vA" means S* has accepted (n=1, v=A)
"dA" means S* receives decided(v=A)

Note Paxos only requires a majority of the servers
  so we can continue even though S3 was down
  proposer must not wait forever for any acceptor's response

What would happen if network partition?
  I.e. S3 was alive and had a proposed value B
  S3's prepare would not assemble a majority

More examples:
  How does Paxos ensure that the following sequence of events can't
  happen? What actually happens, and which value is ultimately chosen?
  proposer 1 crashes after sending two accepts
  proposer 2 has a different value in mind
  S1: p1 a1foo
  S2: p1       p2 a2bar
  S3: p1 a1foo p2 a2bar
  S3's prepare_ok to S2 really included "foo"
    thus should be a2foo, and so no problem
  the point:
    if the system has already reached agreement, majority will know value
    any new majority of prepares will intersect that majority
    so subsequent proposer will learn of already-agreed-on value
    and send it in accept msgs

example 2 (concurrent proposers):
S1 starts proposing n=10
S1 sends out just one accept v=10
S3 starts proposing n=11
  but S1 does not receive its proposal
  S3 only has to wait for a majority of proposal responses
S1: p10 a10foo
S2: p10        p11
S3: p10        p11  a11bar
But S1 and S3 have accepted different values!
what will happen?
  what will S2 do if it later gets a10foo accept msg from S1? (reject)
  what will S1 do if it later gets a11bar accept msg from S3 (ignore)?
  what if S3 were to crash at this point (and not restart)?

how about this:
S1: p10  a10foo               p12
S2: p10          p11  a11bar  
S3: p10          p11          p12   a12foo
has the system agreed to a value at this point?

what's the commit point?
  i.e. exactly when has agreement been reached?
  i.e. at what point would changing the value be a disaster?
  after a majority has the same v_a/n_a? yes -- why sufficient?  sketch:
    suppose majority has same v_a/n_a
    acceptors will reject accept() with lower n
    for any higher n: prepare's must have seen our majority v_a/n_a (overlap)
  
why does the proposer need to pick v_a with highest n_a?
S1: p10  a10foo               p12
S2: p10          p11  a11bar  
S3: p10          p11  a11bar  p12   a12foo??
n=11 already agreed on bar
n=12 sees both foo and bar, but must choose bar
why: if a majority has accepted,
   then the highest n_a contains the majority-accepted value
     
why does prepare handler check that n > n_h?
  responding to all prepare() with prepare_ok() would be also fine,
    (but still needs to update n_h)
    proposers with n < n_h would be later ignored by accept() anyway

why does accept handler check n >= n_h?
  required to ensure there is a unique majority
  w/o n >= n_h check, you could get this bad scenario:
  S1: p1 p2 a1foo
  S2: p1 p2 a1foo a2bar
  S3: p1 p2      a2bar

why does accept handler update n_h = n?
  required to prevent earlier n's from being accepted
  node can get accept(n,v) even though it never saw prepare(n)
  without n_h = n, can get this bad scenario:
  S1: p1    a2bar a1foo p3 a3foo
  S2: p1 p2           p3 a3foo
  S3:    p2 a2bar

Why does acceptor have to remember n_a,n_v across crash? 
S1: p1  a1foo
S2: p1  a1foo reboot  p2  a2bar?
S3: p1               p2  a2bar?
S2 must remember v_a/n_a across reboot! on disk
  might be only intersection with new proposer's majority
  and thus only evidence that already agreed on foo

what if an acceptor reboots after sending prepare_ok?
  does it have to remember n_h on disk?
  if n_h not remembered, this could happen:
  S1: p10            a10foo
  S2: p10 p11 reboot a10foo a11bar
  S3:     p11               a11bar
  S1 proposer did not see foo, so S1 proposed its own value
  but just before that, foo had been chosen!
  b/c S2 did not remember a10v10

can Paxos get stuck?
  yes, if there is not a majority that can communicate
  how about if a majority is available?
    possible to livelock: 

S1: p1  p2  p3...
S2: p1  p2  p3...
S3: p1  p2  p3... 

Paxos guarantees:
- safety: all nodes decide on the same value (consensus)
- liveness: live nodes eventually decide if a majority nodes are alive and can talk to each other w.h.p.


----- How to use Paxos to implement RSM ------

RSM: all replicas execute the same set of ops in the same sequence

0: op0
1: op1
2: op2
3: op3
...

Idea: allow Paxos to be able to agree on more than one value.
      One instance  ---> one agreement
      run Paxos to agree on what op should be at a specific seqno.
      One seqno ---> one agreement

Paxos provides consensus on what op at a seqno. Not sufficient.
Other invariants:
- A replica should propose op at seqno s only if all seqno < s have been decided (by Paxos)
(to preserve linearizability's completion-to-issue order)
- Above serializes operations, no good. A better invariant
- Delay reply to clients for seqno s until all seqno < s have been decided.

Counter-examples:
P(x=1) P_ok(x=1) P(x=2) P_ok(x=2)

S1: (s=2)p1 (s=2)a1(x=1) P_ok(x=1)
S2:  (s=2)p1 (s=2)a1(x=1)          (s=1)p1  (s=1)a1(x=2)
S3:                                     (s=1)p1  (s=1)a1(x=2)


--------- Optimization (practical Paxos) -------
Elect a leader (some external mechanism, might be wrong)

* clients send requests to leader
* (hopefully there's one leader) only leader proposes
* one RTT instead of two RTTs

S1:(s=1-\inf)p1 (s=1)a1(x=10) (s=1)d(x=10) (s=2)a1(x=11) crash
S2:(s=1-\inf)p1 (s=1)a1(x=10) (s=1)d(x=10) (s=2)a1                (s=1-\inf)p2(reply(s=1)d(x=10)(s=2)a(x=11)(s=2-\inf)nil  (s=2)a2(x=11)
S3:(s=1-\inf)p1                                                   (s=1-\inf)p2                                            (s=2)a2(x=11)