11/28/00 -- Replication and Consistency

Big picture over next few days.
  We want to replicate data for reliability.
  How do we manage replicas consistently?
  Can we replicate transparently to applications?
  Can we replicate and maintain high performance?

Example application: storing mail boxes.
  Each mailbox is a list of numbered messages.
  Operations by clients:
    M = Read(Num)
    Num = Append(M)
    Delete(Num)

Can anything go wrong in a simple single-server implementation?
  Serializability:
    What if two clients call Append() at the same time?
    Implement with locks.
  Atomicity:
    What if the system crashes in the middle of an Append() or Delete()?
    Implement with transactions and logs.
  Durability:
    What if the system crashes just after Append() returns?
    Implement by flushing data and logs carefully.
  You learned how to do this in 6.033.

Why do programmers like this model?
  Hides existence of concurrency.
  Hides failure recovery details.
  General tool, same for many applications.

Why isn't single-server DB good enough?
  Single point of failure.
  Want to replicate data on multiple machines.
    May help read (or even write) performance as well as availability.

Straw man replicated database:
  Imagine 2 servers, every mailbox replicated on both.
  Client sends read operations to either server.
  Client sends update operations to both.
    Waits for one to reply.

Use Miguel's time diagrams.

What can go wrong?
  1. Suppose a client does Num = Append(M); Delete(Num).
     Updates may occur in different order on two servers.
     Can maybe fix for single client by waiting for all ACKs.
     So what about two clients and concurrent Append()s?
       Servers end up with differing copies.
  2. Network partitions.
     Different updates proceed in both halves.
     Again, "replicas" now no longer replicate each other.

How do we know these behaviors are wrong?
  Could not have happened in a single-copy system.
  Our task is to emulate single-copy with replicas.
  So that simple applications work correctly.

Can fix partition problem with voting:
  Use 2n+1 replicas if you want to survive n failures.
  Only allow operations in a partition with >= n+1 members.
    There can be at most one such partition.

Can fix order problem with primary copy:
  One primary, clients send operations to it.
  Primary imposes order on concurrent operations.
  Primary tells slaves about operations, with number.
  Slaves perform operation, then send ACK to primary.

Second straw man:
  Is this enough?
  Can primary just wait for ACKs, then respond to client?

No!
  What if it turns out there are fewer than n slaves?
    Then primary should abort operation and send error to client.
    But some of the slaves have performed it.
  So slaves need to defer actual update until primary confirms.

2-phase commit protocol:
  1: Primary sends updates to slaves.
     Slaves append update information to a log, but don't yet perform.
     Slaves ACK first phase to primary.
     Primary waits for n replies (so n+1 replicas).
     Primary replies "YES" to client.
  2: In background, primary tells slaves that commit happened.
     Slaves update real DB from log entry.

What if the primary fails before sending client the ACK?
  I.e. while sending phase 1 msgs or collecting slave ACKs.
  If some slave got the phase 1 message, it can re-start.
  But it's OK to just abort the operation -- client hasn't seen an ACK.

What if the primary fails after ACKing client?
  But before sending phase 2 commit messages to all slaves.
  Operation *must* complete because client has seen ACK.
  New primary can ask remaining replicas.
    If n+1 saw (and acked) the phase 1 message,
    new primary can safely commit.

What if slave fails?
  Doesn't matter -- can just keep going as long as > 1/2 survive.

What about concurrent operations?
  Primary numbers them and allows them to proceed.
  Primary and slaves keep a log of all operations.
  Slave only ACKs a phase 1 msg if it has seen all prior phase 1 msgs.
  Primary only sends out a commit if it has committed all previous.
    Otherwise reboot could lose a later op but commit an earlier one.
  Primary log looks like:
    Old committed operations
    <-- Commit Point (CP)
    Uncommitted operations

Reads:
  Can't just send to any replica.
    Must make sure it's in the majority partition.
    Otherwise may miss a committed write in the other half.
  Read must reflect all committed updates.
    So clients have to send reads to the primary.
  A read does not wait for prior uncommitted writes to commit.
    Since those calls have not yet returned to the application.

Slave log:
  If a replica ACKs phase 1:
    It can't yet write the DB.
    But it has promised to do so when asked.
      Since primary may already have ACKed client.
    Slave should not ACK, reboot, change its mind.
    So it must have a stable log.
  Same order as primary log.

When is real on-disk DB written (from log)?
  Primary sends current CP along with every message.
    Slaves remember largest CP seen.
  Real DB entries written only when CP passes log entry.
  Updates in background. In log order.

As-yet unresolved issues:
  Transactions that involve multiple DB operations.
    How to preserve serializeability?
  What happens if a slave fails?
    Need to make it up to date.
  What happens if the primary fails?
    How to agree on the choice of a new primary?
    How to reconstruct primary's state?

What kind of performance are we likely to get?
  Every operation involves all the servers.
  So it's likely to be *slower* than a single server.

Can we replicate and still get high performance?
  Yes -- Professor Kaashoek will tell us how.