6.824 2012 Lecture 16:  Megastore

Megastore: Providing Scalable, Highly Available Storage for
Interactive Services, Baker et al., CIDR 2011.

Why are we reading this paper?
  A modern, deployed storage system, used for some big Google services
  Unusual: synchronous WAN replication 
  Unusual: ACID transactions on WAN
  Unusual: Paxos for every update operation
  Unusual: not primary/backup, any replica can initiate an update
  Details of optimizations to make Paxos-per-op faster

Who is the customer?
  Paper is not specific
  Consumer-facing web sites
  And Google App Engine -- like EC2
    I.e. they are selling storage as a service
    Not just an internal tool
  Examples: email, collab editing / blogs, social networking

What might the customer want?
  100% available => replication, seamless fail-over
  Never lose data => don't ack until truly durable
  Replicated at multiple data centers, for low latency and availability
  Consistent -- transactions
  High performance

What's a transaction?
  (paper's "fully serializable ACID semantics")
  BEGIN reads and writes END
  serializable
    as if executed one at a time, in some order
    Megastore in fact executes them one at a time
  serializable implies:
    no intermediate state visible
    no read-modify-write races
    a transaction's reads see data at just one point in time
  durable (the D in ACID)

How about primary/backup replication, with Paxos to choose primary?
  Slow -- wide-area (100 ms?) exchange for every operation
  Slow -- one-at-a-time
  Detecting primary crash takes a while
  Fail-over may require backups to compare notes
  Recovering server may require xfer of big state or missed operations
  Sometimes not available even when some replica is alive:
    Paxos prevents operation if no majority
    Paxos prevents operation in minority partition

Are these problems with primary/backup fixable?
  Can we expect Megastore to solve them?

Some conventional wisdom about wide-area systems
  Hard to have both consistency and performance
    Consistency requires communication
  Hard to have both consistency and availability
    Can't use a partitioned (but otherwise perfectly good) replica

Popular solution: relaxed consistency
  E.g. r/w local replica, send writes to others in background
    "asynchronous replication"
  Reads may yield stale data
  Multiple-write operations may not be atomic
  Read-modify-write races may yield lost updates
  Programmers can deal with this, but it's not so nice
  We'll see examples in a few weeks

Megastore
  Flouts conventional wisdom
  Wide-area synchronous replication
  Wide-area consistency

Note
  This paper has lots of details and lots omitted
  Ask questions

What's the basic design?
  Figure 5
  At each data center (== "replica"):
    BigTable cluster
    Application Server + Megastore Library
    Replication Server
    Coordinator
  Data in BigTable intended to be identical at all replicas
  Probably many Application Servers per data center
    Serving different users and different applications

Browser web requests may arrive at any replica
  I.e. at the Application Server at any replica
  There is no special "primary" replica
  So could be concurrent transactions on same data from multiple replicas

Transactions can only use data within a single "entity group"
  An entity group is one row, or a set of related rows
  Defined by application
  E.g. all my e-mail msgs may be in single entity group
    And yours will be in a different entity group
  Example transaction:
    Move msg 451 from Inbox to Personal
  This probably can't be a transaction:
    Deliver a msg to both rtm and yandong

How do transactions work?
  (Simplified plan, based on Section 3.3)
  Each entity group has a log of transactions
    Stored in BigTable, a copy at each replica
  Data in BigTable should == result of playing log
  Transaction code in Application Server:
    Find highest log entry # = n
    Read data from local BigTable
    Accumulate writes in temporary storage
    Create log entry: the set of writes
    Use Paxos to agree that log entry n+1 = our entry
      Result: every replica's log has same content for entry n+1
      This is the commit point
    Apply writes in log entry to BigTable data

Note:
  Commit requires waiting for inter-datacenter msgs
  Only a majority of replicas need to respond
  Non-responders may miss some log entries
    Later transactions will need to repair this
  There might be conflicting transactions
  No primary!

Figure 6 illustrates a log example
  Each column is one replica's log
  Each row is one log entry == one Paxos instance
  A replica's log seems to hold its Paxos state
    E.g. it may have accepted a value that won't be the final value
  The logs may initially differ
    But eventually they must agree

What if there are concurrent transactions?
  Problem: might be a data race
    e.g. both do x = x + 1
  Problem: they will both want the same log slot
  Megastore allows one to commit, aborts the others
    Conservatively prohibits concurrency with an entity group
    So it doesn't use traditional DB locking
      Which would allow concurrency if non-overlapping data
  Conflicts are caught during Paxos agreement
    Application Server will find that some other xaction got log entry n+1
    And application must re-try the whole transaction

How do reads work?
  must get latest data (otherwise not consistent)
  would like to avoid inter-replica communication
  ideally would read from local BigTable w/o talking to any other replica
  problems:
    1. maybe this replica missed some operations
    2. maybe replica's log has operations but not yet applied
  solution for #1: per-replica Coordinator Server
    flag for each entity group, saying if up to date
    a transaction *must* clear the flag if replica didn't respond to Paxos
  so: (Section 4.6.2, Figure 7)
    1. ask local coordinator if up to date
       if yes, r = local replica
       if no, r = find a replica that is up to date
    2. find number of highest log entry
    3. apply any not-yet-applied log entries in r to r
       may need to ask other replicas if r is missing entries
    4. if r = local replica, tell local coordinator it's now up to date
    5. read data from r's BigTable

Why is the coordinator needed?
  Coordinator indicates whether local replica has latest log entry
  I.e. whether it's enough to just locally apply all known entries
  And if all known are applied, whether it's OK to just read data

Could we eliminate the coordinator?
  Yes, and it would make the design more robust (as we'll see)
  But reads would have to query all replicas to ensure most recent data
  Would slow reads down by at least 10x

How does transaction commit work?
  (the paper calls this "write", as in 4.4.2 and 4.6.3)
  Paxos agreement of content of next log entry
  Usually takes 2 rounds (prepare, accept)
  They usually just send out accepts, no prepares
  When is prepare not necessary?
    If you know you have lowest proposal #
    You don't need to learn of any existing proposed value
  Each accepted log entry indicates a "leader" for next entry
    Leader gets to choose who submits proposal # 0 for next log entry
    First replica to ask wins that right
    All replicas act as if they had already received the prepare for # 0
  Why does that help?
    Common case: repeated transactions from same replica
    Leader for n+1 is replica who did transaction n
    So application server and leader will be at same replica,
      AS can ask for right to propose #0 w/ local msg,
      then just send out accepts to other replicas
      doesn't need prepare_acks (with values) since it's proposal #0

What if concurrent commits?
  Leader will give one the right to send accepts for proposal #0
  The other will send prepares for higher proposal #
  The higher proposal may still win!
  So proposal #0 is not a guarantee of winning
    Just eliminates one round in the common case

Commit ("write") details
  Section 4.6.3, Figure 8
  1. ask leader for permission to use proposal #0
  2. if "no", send Paxos prepare msgs
  3. send accepts; go to 2 if no majority
  4. send invalidate to coordinator of any replica that did not accept
  5. apply transaction's writes to as many replicas as possible
  6. if you didn't win, return an error; caller will re-run xaction

How does Megastore handle failures?

Failure case: BigTable at replica R1 is overloaded/broken
  (they view this as relatively common: 4.7 2nd para)
  R1 won't respond
  Transactions can still commit (as long as majority respond)
  Need to talk to R1 coordinator (we're assuming it's up and reachable)
  Who drives recovery? not clear
    Reads at R1 will use a different replica
    Maybe end of 4.5 implies replication server periodically fixes

Failure case: App Server crashes during commit
  (maybe they view this as relatively common)
  Paxos deals correctly, depending on whether commit point reached
  End of 4.5 says replication server fixes incomplete Paxos instances
  Reads will deal if crashed before applying

Failure case: replica disconnected for a while
  (they view this as rare)
  Won't respond to Paxos -- OK
  But also coordinator won't respond!
    Which blocks commit (step 4 in 4.6.3)
  4.7 says (I think) that coordinators have leases
    Each must renew lease at every replica periodically
    If it doesn't/can't,
      Commits can ignore it
      It must mark all entity groups as "not up to date"
  So, commit will delay for the lease period, then can proceed
    Non-responding replica will have to catch up later on *all* entity groups

Failure case: one replica persistently slower
  Will others wait for it?
  Will it ever catch up?
  Section 5 hints that it will, by slowing down other replicas

Why Paxos per operation?
  Paper claims low-delay handling of failures
  They're worried about slow/broken BigTable at one replica
    Does Paxos-per-operation help?
    Yes -- maybe not even a timeout
      Can finish commit as soon as majority respond to accept
      Only cost: talk to coordinator
    Application Engines don't have to know there's a problem
    Primary/backup would force primary to wait for slow backup
  They claim p/b has slow failover (2.1.1), but don't explain
     
Another possible reason for Paxos-per-operation:
  Compatible w/ BigTable-centric design
  No dedicated servers, as in Harp
  Application Servers directly r/w state in BigTable
  BigTable knows nothing about transactions &c

Does primary/backup have any advantages?
  Might work better if total replica failure
  E.g. can't reach coordinator
  Megastore has to re-discover the working majority on every commit
  Primary/backup remembers this from op to op
    (By current choice of primary and live backups)
  P/B might also avoid Megastore's notification of coordinator after
    every commit that didn't get full set of accepts

Performance?
  Section 4.10, Figure 10
  Reads take 10s of milliseconds
  Writes take 100s of milliseconds

Is that fast or slow?

How can they support zillions of users at only a few ops/second?
  ...
  Intro says billions of operations per day
  230,000 reads/second
   35,000 writes/second

Summary
  Case study of a big production storage system
  Unconventional:
    Wide-area replication AND consistency
    Synchronous replication
    Paxos-per-operation
  Interesting failure model
    Network is reliable
    Local storage (BigTable) is not very reliable
  Exploits intended workload
    Huge # of independent activities
    So individual ops can be slow, but they can still get high throughput

...

Mysteries
  How is log space freed up?
  Where are Paxos promises to agree durably stored? In log?
  What's in a log entry?
  Who executes Paxos handlers?
  Who applies log entries at remote replicas?
  Who repairs replica after marked "stale" in coordinator?