Lecture 3: Primary/backup replication
Notes adapted from Robert Morris'

Fault tolerance
  we'd like a service that continues despite failures!
  available: still useable despite [some class of] failures
  correct: act just like a single server to clients
  very hard!
  very useful!

Need a failure model: what will we try to cope with?
  Independent fail-stop computer failure
  Site-wide power failure (and eventual reboot)
  (Network partition)
  No common bugs, no malice

Core idea: replication
  *Two* servers (or more)
  Each replica keeps state needed for the service
  If one replica fails, others can continue

Example: fault-tolerant MapReduce master
  lab 1 workers are already fault-tolerant, but not master
    master is a "single point of failure"
  can we have two masters, in case one fails?
  [diagram: M1, M2, workers]
  state:
    worker list
    which jobs are done
    which workers idle
    TCP connection state
    program counter

Big Questions:
  What state to replicate?
  How to keep replicas' state identical?
  When to cut over to backup?
  Are anomalies visible at cut-over?
  How to repair / re-integrate?

Two main approaches:
  * State transfer
    "Primary" replica executes the service
    Primary periodically sends [new] state to backups

   Draw on board
   Primary                  Backup
   op1 -> state'
   op2 -> state''
   op3 -> state'''
                 ---------> 

 
  * Replicated state machine
    All replicas execute all operations
    If same start state,
      same operations,
      same order,
      deterministic,
      then same end state


    Primary 
    op1   -------->  

Choose a replication method for:
* Replicate a virtual machine
  Asynchronous state transfer.
* Replicate a key/value store
  Replicated state machine.

PRIMARY-BACKUP REPLICATION IN LAB 2

outline:
  simple key/value database
    Get(k), Put(k, v), Append(k, v)
  primary and backup
  replicate by primary sending each operation to backups
  tolerate network problems, including partition
    either keep going, correctly
    or suspend operations until network is repaired
  allow replacement of failed servers
  you implement essentially all of this 

"view server" decides who p and b are
  main goal: avoid "split brain" -- disagreement about who primary is
  clients and servers ask view server
  they don't make independent decisions

repair:
  view server can turn an "idle" server as b after old b becomes p
  primary initializes new backup's state

Invariants:
  1. only one primary at a time!
  2. the primary must have the latest state!
  we will work out some rules to ensure these

view server
  maintains a sequence of "views"
    view #, primary, backup
    0: -- --
    1: S1 --
    2: S1 S2
    4: S2 --
    3: S2 S3
  monitors server liveness
    each server periodically sends a Ping RPC
    "dead" if missed N Pings in a row
    "live" after single Ping
  If > 2 servers Pinging view server
    if more than two, "idle" servers
  if primary is dead
    new view with previous backup as primary
  if backup is dead, or no backup
    new view with previously idle server as backup
  OK to have a view with just a primary, and no backup
    but -- if an idle server is available, make it the backup

Q: does the new primary always have up-to-date replica of state?
  only promote previous backup
  i.e. don't make an idle server the primary
  
Q: can more than one server think it is primary?
   1: S1, S2
      net broken, so viewserver thinks S1 dead but it's alive
   2: S2, --
   now S1 alive and not aware of view #2, so S1 still thinks it is primary
   AND S2 alive and thinks it is primary
   => split brain, no good

how to ensure only one server acts as primary?
  even though more than one may *think* it is primary
  "acts as" == executes and responds to client requests
  the basic idea:
    1: S1 S2
    2: S2 --
    S1 still thinks it is primary
    S1 must forward ops to S2
    S2 thinks S2 is primary
    so S2 must reject S1's forwarded ops

the rules:
  1. primary in view i must have been primary or backup in view i-1 
  2. if you think you are primary, must wait for backup for each request
  3. if you think you are not backup, reject forwarded requests
  4. if you think you are not primary, reject direct client requests

so:
  before S2 hears about view #2
    S1 can process ops from clients, S2 will accept forwarded requests
    S2 will reject ops from clients who have heard about view #2
  after S2 hears about view #2
    if S1 receives client request, it will forward, S2 will reject
      so S1 can no longer act as primary
    S1 will send error to client, client will ask vs for new view,
       client will re-send to S2
  the true moment of switch-over occurs when S2 hears about view #2

how can new backup get state?
  e.g. all the keys and values
  if S2 is backup in view i, but was not in view i-1,
    S2 should ask primary to transfer the complete state

rule for state transfer:
  every operation (Put/Get/Append) must be either before or after state xfer
    == state xfer must be atomic w.r.t. operations
  either
    op is before, and xferred state reflects op
    op is after, xferred state doesn't reflect op, prim forwards op after state

Q: does primary need to forward Get()s to backup?
   after all, Get() doesn't change anything, so why does backup need to know?
   and the extra RPC costs time

Q: how could we make primary-only Get()s work?

Q: are there cases when the lab 2 protocol cannot make forward progress?
   View service fails
   Primary fails before new backup gets state
   We will start fixing those in lab 3