6.824 2012 Lecture 16: Megastore Megastore: Providing Scalable, Highly Available Storage for Interactive Services, Baker et al., CIDR 2011. Why are we reading this paper? A modern, deployed storage system, used for some big Google services Unusual: synchronous WAN replication Unusual: ACID transactions on WAN Unusual: Paxos for every update operation Unusual: not primary/backup, any replica can initiate an update Details of optimizations to make Paxos-per-op faster Who is the customer? Paper is not specific Consumer-facing web sites And Google App Engine -- like EC2 I.e. they are selling storage as a service Not just an internal tool Examples: email, collab editing / blogs, social networking What might the customer want? 100% available => replication, seamless fail-over Never lose data => don't ack until truly durable Replicated at multiple data centers, for low latency and availability Consistent -- transactions High performance What's a transaction? (paper's "fully serializable ACID semantics") BEGIN reads and writes END serializable as if executed one at a time, in some order Megastore in fact executes them one at a time serializable implies: no intermediate state visible no read-modify-write races a transaction's reads see data at just one point in time durable (the D in ACID) How about primary/backup replication, with Paxos to choose primary? Slow -- wide-area (100 ms?) exchange for every operation Slow -- one-at-a-time Detecting primary crash takes a while Fail-over may require backups to compare notes Recovering server may require xfer of big state or missed operations Sometimes not available even when some replica is alive: Paxos prevents operation if no majority Paxos prevents operation in minority partition Are these problems with primary/backup fixable? Can we expect Megastore to solve them? Some conventional wisdom about wide-area systems Hard to have both consistency and performance Consistency requires communication Hard to have both consistency and availability Can't use a partitioned (but otherwise perfectly good) replica Popular solution: relaxed consistency E.g. r/w local replica, send writes to others in background "asynchronous replication" Reads may yield stale data Multiple-write operations may not be atomic Read-modify-write races may yield lost updates Programmers can deal with this, but it's not so nice We'll see examples in a few weeks Megastore Flouts conventional wisdom Wide-area synchronous replication Wide-area consistency Note This paper has lots of details and lots omitted Ask questions What's the basic design? Figure 5 At each data center (== "replica"): BigTable cluster Application Server + Megastore Library Replication Server Coordinator Data in BigTable intended to be identical at all replicas Probably many Application Servers per data center Serving different users and different applications Browser web requests may arrive at any replica I.e. at the Application Server at any replica There is no special "primary" replica So could be concurrent transactions on same data from multiple replicas Transactions can only use data within a single "entity group" An entity group is one row, or a set of related rows Defined by application E.g. all my e-mail msgs may be in single entity group And yours will be in a different entity group Example transaction: Move msg 451 from Inbox to Personal This probably can't be a transaction: Deliver a msg to both rtm and yandong How do transactions work? (Simplified plan, based on Section 3.3) Each entity group has a log of transactions Stored in BigTable, a copy at each replica Data in BigTable should == result of playing log Transaction code in Application Server: Find highest log entry # = n Read data from local BigTable Accumulate writes in temporary storage Create log entry: the set of writes Use Paxos to agree that log entry n+1 = our entry Result: every replica's log has same content for entry n+1 This is the commit point Apply writes in log entry to BigTable data Note: Commit requires waiting for inter-datacenter msgs Only a majority of replicas need to respond Non-responders may miss some log entries Later transactions will need to repair this There might be conflicting transactions No primary! Figure 6 illustrates a log example Each column is one replica's log Each row is one log entry == one Paxos instance A replica's log seems to hold its Paxos state E.g. it may have accepted a value that won't be the final value The logs may initially differ But eventually they must agree What if there are concurrent transactions? Problem: might be a data race e.g. both do x = x + 1 Problem: they will both want the same log slot Megastore allows one to commit, aborts the others Conservatively prohibits concurrency with an entity group So it doesn't use traditional DB locking Which would allow concurrency if non-overlapping data Conflicts are caught during Paxos agreement Application Server will find that some other xaction got log entry n+1 And application must re-try the whole transaction How do reads work? must get latest data (otherwise not consistent) would like to avoid inter-replica communication ideally would read from local BigTable w/o talking to any other replica problems: 1. maybe this replica missed some operations 2. maybe replica's log has operations but not yet applied solution for #1: per-replica Coordinator Server flag for each entity group, saying if up to date a transaction *must* clear the flag if replica didn't respond to Paxos so: (Section 4.6.2, Figure 7) 1. ask local coordinator if up to date if yes, r = local replica if no, r = find a replica that is up to date 2. find number of highest log entry 3. apply any not-yet-applied log entries in r to r may need to ask other replicas if r is missing entries 4. if r = local replica, tell local coordinator it's now up to date 5. read data from r's BigTable Why is the coordinator needed? Coordinator indicates whether local replica has latest log entry I.e. whether it's enough to just locally apply all known entries And if all known are applied, whether it's OK to just read data Could we eliminate the coordinator? Yes, and it would make the design more robust (as we'll see) But reads would have to query all replicas to ensure most recent data Would slow reads down by at least 10x How does transaction commit work? (the paper calls this "write", as in 4.4.2 and 4.6.3) Paxos agreement of content of next log entry Usually takes 2 rounds (prepare, accept) They usually just send out accepts, no prepares When is prepare not necessary? If you know you have lowest proposal # You don't need to learn of any existing proposed value Each accepted log entry indicates a "leader" for next entry Leader gets to choose who submits proposal # 0 for next log entry First replica to ask wins that right All replicas act as if they had already received the prepare for # 0 Why does that help? Common case: repeated transactions from same replica Leader for n+1 is replica who did transaction n So application server and leader will be at same replica, AS can ask for right to propose #0 w/ local msg, then just send out accepts to other replicas doesn't need prepare_acks (with values) since it's proposal #0 What if concurrent commits? Leader will give one the right to send accepts for proposal #0 The other will send prepares for higher proposal # The higher proposal may still win! So proposal #0 is not a guarantee of winning Just eliminates one round in the common case Commit ("write") details Section 4.6.3, Figure 8 1. ask leader for permission to use proposal #0 2. if "no", send Paxos prepare msgs 3. send accepts; go to 2 if no majority 4. send invalidate to coordinator of any replica that did not accept 5. apply transaction's writes to as many replicas as possible 6. if you didn't win, return an error; caller will re-run xaction How does Megastore handle failures? Failure case: BigTable at replica R1 is overloaded/broken (they view this as relatively common: 4.7 2nd para) R1 won't respond Transactions can still commit (as long as majority respond) Need to talk to R1 coordinator (we're assuming it's up and reachable) Who drives recovery? not clear Reads at R1 will use a different replica Maybe end of 4.5 implies replication server periodically fixes Failure case: App Server crashes during commit (maybe they view this as relatively common) Paxos deals correctly, depending on whether commit point reached End of 4.5 says replication server fixes incomplete Paxos instances Reads will deal if crashed before applying Failure case: replica disconnected for a while (they view this as rare) Won't respond to Paxos -- OK But also coordinator won't respond! Which blocks commit (step 4 in 4.6.3) 4.7 says (I think) that coordinators have leases Each must renew lease at every replica periodically If it doesn't/can't, Commits can ignore it It must mark all entity groups as "not up to date" So, commit will delay for the lease period, then can proceed Non-responding replica will have to catch up later on *all* entity groups Failure case: one replica persistently slower Will others wait for it? Will it ever catch up? Section 5 hints that it will, by slowing down other replicas Why Paxos per operation? Paper claims low-delay handling of failures They're worried about slow/broken BigTable at one replica Does Paxos-per-operation help? Yes -- maybe not even a timeout Can finish commit as soon as majority respond to accept Only cost: talk to coordinator Application Engines don't have to know there's a problem Primary/backup would force primary to wait for slow backup They claim p/b has slow failover (2.1.1), but don't explain Another possible reason for Paxos-per-operation: Compatible w/ BigTable-centric design No dedicated servers, as in Harp Application Servers directly r/w state in BigTable BigTable knows nothing about transactions &c Does primary/backup have any advantages? Might work better if total replica failure E.g. can't reach coordinator Megastore has to re-discover the working majority on every commit Primary/backup remembers this from op to op (By current choice of primary and live backups) P/B might also avoid Megastore's notification of coordinator after every commit that didn't get full set of accepts Performance? Section 4.10, Figure 10 Reads take 10s of milliseconds Writes take 100s of milliseconds Is that fast or slow? How can they support zillions of users at only a few ops/second? ... Intro says billions of operations per day 230,000 reads/second 35,000 writes/second Summary Case study of a big production storage system Unconventional: Wide-area replication AND consistency Synchronous replication Paxos-per-operation Interesting failure model Network is reliable Local storage (BigTable) is not very reliable Exploits intended workload Huge # of independent activities So individual ops can be slow, but they can still get high throughput ... Mysteries How is log space freed up? Where are Paxos promises to agree durably stored? In log? What's in a log entry? Who executes Paxos handlers? Who applies log entries at remote replicas? Who repairs replica after marked "stale" in coordinator?