Notes partially borrowed from Robert Morris, 6.824.

The design of a practical system for fault-tolerant virtual machines

The motivation: 
  Software that can continue despite failures!
  - e.g. machines crashes
  - not network failures, malicious software etc.

The main approach
  - Replicated virtual machine
  - *Two* servers, each keeps state 
    If one replica fails, the other can continue

Why replicate at the VM level?
  - how about replicate at the storage level.

Example usage: fault-tolerant MapReduce master
  lab 1 master is a "single point of failure"
  can we have two masters, in case one fails?
  [diagram: M1, M2, workers]
  state:
    worker list
    which jobs done
    which workers idle
    TCP connection state
    program counter

  How does M/R replicate master's state? 
    - do nothing
    - checkpoints state to GFS

Two main approaches:
  State transfer
    "Primary" replica executes the service
    Primary sends [new] state to backups
  Replicated state machine
    All replicas execute all operations
    If same start state,
      same operations,
      same order,
      deterministic,
      then same end state

State transfer is simpler
  But state may be large, slow to transfer
  VM-FT uses replicated state machine

Replicated state machine is more efficient
  operations tend to be smaller than data
  But complex to get right
    e.g. order on multi-core, determinism

What are the operations for replicated state machine?
  K/V put and get?
  x86 instructions? <---- VM-FT!!!
  Pros: 
  - Completely transparent to applications and clients
  - Hard to achieve good performance (compared to w/o replication)
  
The design of a Practical System for Fault-Tolerant Virtual Machines
Scales, Nelson, and Venkitachalam, SIGOPS OSR Vol 44, No 4, Dec 2010

Overview
  [diagram: app, O/S, VM-FT underneath]
  two machines, primary and backup; and other machines
  two networks: clients-to-servers, logging channel
  shared-disk for persistent storage
  back-up in "lock step" with primary
    primary sends all inputs to backup
    outputs of backup are dropped
  heart beats between primary and backup
    if primary fails, start backup executing!

Challenges:
  1. How to make backup an exact replica of primary
    What operations must send to backup?
      Clock interrupts?
    How to deal with non-determinism?
      E.g., Interrupt must be delivered at backup at same instruction as at primary
  2.  How to make failover consistent?
      if primary fails just before or after sending response to client
      might a client request be lost? executed twice?
    When does the primary send a response to a client?
  
Challenge 1: deterministic replay
  Goal: make x86 platform deterministic
    idea: use hypervisor to make virtual x86 platform deterministic
    two phases: logging and replay
  Log all hardware events into a log
    clock interrupts, network interrupts, i/o interrupts, etc.
    for non-deterministic instructions, record additional info
      e.g., log the value of the time stamp register
      on replay: return the value from the log instead of the actual register
  Replay: delivery inputs in the same order at the same instructions
    if during recording delivered clock interrupt at nth instruction executed
    during replay also delivers the clock interrupt at the nth instruction
  Given a log of events, deterministic replay recreates VM
    hypervisor delivers first event
    lets the machine execute to the next event
      using special hardware registers to stop the processor at the right instruction
    the virtual x86 executes identical during replay as during recording
      OS runs identical
      Applications runs identical
      -> Same outputs will be generated on replay as during recording
  Limitation: cannot handle multicore processors x86
    Too expensive to record and replay the correct interleaving of instructions

Example:
  primary receives network interrupt
    hypervisor forwards interrupt plus data to backup
    hypervisor delivers network interrupt to OS kernel
    OS kernel runs
    kernel delivers packet to server
    server/kernel write response to network card
    hypervisor gets control and puts response on the wire
  backup receives log entries
    backup delivers network interrupt at the same point in instruction stream
    hypervisor delivers interrupt to its OS kernel
    ...
    hypervisor gets control
      does *not* put response on the wire

    hypervisor ignores local clock interrupts
      it gets clock interrupts from primary
  primary and backup get same inputs, end up in same state

Challenge 2 solution: FT protocol
  Primary delays any external output (e.g. a packet) until the backup acks
    Log entry for each output operation
    Primary sends output after backup acked receiving output operation
  Performance optimization:
    primary keeps executing passed output operations
    buffers output until backup acknowledges

Q: Why send output events to backup and delay output until backup has acked?
  Consider: don't log output events. Log only input events.
  Primary:
    process network input
    produces output
    primary fails
  Backup cannot reproduce this sequence correctly:
    last log entry is: process network input
      deliver it to kernel
    backup goes "live"
      it becomes the primary
    hardware interrupt (e.g., clock)
      deliver it to the kernel  (since backup is live now)
    the network input results in output
      did the primary produce output before hardware interrupt or before?
      backup doesn't know
        it doesn't have a log entry for the interrupt
	it doesn't have a log entry for output
      important because clock interrupt may have influenced output
  By sending start output event to backup, backup can order events correctly
    clock interrupt before or after start output event

Note that both primary and backup can produce the same output event
  --> Authors claim producing output twice is *not* a problem
    if output is network packet, client must be able to handle duplicate packets
    if output is write to disk, write the same data twice to the same location
      but there cannot any other writes in between (they would have been in the log)
      so, should be ok
    
Q: What happens when primary fails after receiving network input but before
sending a corresponding log entry to backup?
  A: network input. service relies on client to retry.
    this is reasonable because network could have lost request packet
  A: disk input. hypervisor restarts pending disk I/O

Challenge 2 solution: shared disk
    Backup replays through last log entry
    Backup atomically test-and-set variable on disk
      If set, primary is still alive. Commit suicide
      If not set, primary is dead. Become primary
      If primary, create new backup from checkpoint
        Using VMotion
      
Shared storage is single-point of failure
    If shared storage is down, service is down
  Ok for paper's setting, but for geo-replicate services
    Cannot survive earthquakes

Q: How long is service unavailable after primary fails?
  Detect failure
  Execute log entries
    VM-FT slows down primary if backup gets too far behind
  Write shared storage
  Switch to "live"

Performance (table 1)
  FT/Non-FT: impressive!
    little slow down
  Why oracle/ ms-sql generate more logging b/w?

Summary:
  Primary-backup replication
    VM-FT: clean example
  How to get better performance?
    Primary-back replication using higher-level replicated state machines
      key/value operations such as put and get