6.824 2020 Lecture 10: Database logging, quorums, Amazon Aurora

Amazon Aurora, Verbitski et al., SIGMOD 2017.

Why are we reading the Aurora paper?
  successful recent cloud service, solves serious problems for customers
  big payoff for good design (Table 1)
  shows limits of general-purpose storage abstraction
  many tidbits about what's important in real-world cloud infrastructure

Here's my understanding of the story that led to Aurora.

Amazon EC2 -- cloud computing, aimed at web sites
  [VMM, linux guests, www servers &c]
  [web clients, EC2 www servers, EC2 DB servers w/ local disk]
  you can rent virtual machines -- "instances"
  run in Amazon's datacenters, on their physical machines
  virtual local disk stored on directly attached physical disk
  customers run their www servers and DBs on EC2
  EC2 is great for www servers
    more load -> rent more EC2 instances
    crash -> who cares, stateless, data in DB
  EC2 is not ideal for DB e.g. MySQL
    limited scaling options -- just read/only DB replicas
    limited fault-tolerance; if phys machine dies, disk dies with it
      periodic DB backups to Amazon's "S3" bulk storage service

Amazon EBS (Elastic Block Store)
  [diagram of EC2 instance, two EBS servers -- left half of Figure 2]
  block storage that's still available even if an EC2 instance crashes
  looks like a disk to EC2 instances, mount e.g. Linux ext4 on EBS volume
  implemented as pairs of servers with disk drives, chain replication,
    paxos-based configuration manager
  both replicas are in the same "availability zone" (AZ)
    AZ = machine room or datacenter
    for speed, since all client writes have to wait for both EBS servers
  for a DB on EC2, much better fault-tolerance than locally-attached storage!
    if DB EC2 instance crashes, just re-start it on another EC2 instance
    and attach to the same EBS volume
  note: only one EC2 instance can use a given EBS volume at a time
    EBS is not shared storage

DB-on-EBS shortcomings
  lots of data sent over network -- both log and dirty data pages
    the data pages are big even if only a few bytes are changed
  not fault-tolerant enough
    two replicas not enough, but a chain with more would be too slow!
    both replicas are in same AZ
      vulnerable to fire, flood, power failure, broken Internet connection

Paper assumes you know how DB works, how it uses storage.
  here's an overview of a generic transactional DB
  single machine
  example bank transaction -- transfer $10 from x's account to y's account
    x = x + 10
    y = y - 10
  locks x and y, released after commit finishes
  data in b-tree on disk
  cached data pages
  DB server modifies only cached data pages as transaction runs
  and appends update info to WAL (Write-Ahead Log)
    the paper's "redo log" == MySQL's INNODB transaction log.
  log entries:
    LSID  TID  key  old  new
    101     7    x  500  510     (generated by x = x + 10)
    102     7    y  750  740     (generated by y = y - 10)
    103     7  commit            (transaction finished)
  commit once WAL safe on disk
    release locks on x and y
    reply to client
  later write modified data pages from cache to disk
    to free log space, allow cache eviction
    crucial to performance to delay -- write absorbtion, since pages are big
  crash recovery
    replay "new" values for all committed xactions in log (redo)
    replay "old" values for uncommitted xactions in log (undo)

Next idea: Figure 2
  This is Amazon's Multi-AZ RDS
    database-as-a-service, rather than customers running their own on EC2
  Goal: better fault tolerance by cross-AZ replication
  EBS-level mirroring of DB's EBS volume
    Every database server EBS write sent to local EBS *and*
      mirror EBS driven by another EC2 instance.
    Mirroring log and all dirty data pages.
  DB write has to wait for all four EBS replicas to respond.
    Long delays, lots of data.
    But more fault-tolerant due to 2nd AZ.

Aurora big picture
  [diagram]
  DB clients (e.g. web servers).
  One database server (in an EC2 instance).
    Dedicated to one customer.
  Six storage servers store six replicas of a volume.
    Each volume is private to the one database server.

Aurora uses two big ideas:
  Quorum writes for better fault-tolerance without too much waiting
  Storage servers understand how to apply DB's log to data pages
    So only need to send (small) log entries, not (big) dirty pages
    Sending to many replicas, but not much data

Table 1 summarizes results -- 35x throughput increase!
  Hard to guess is hardware resources are equivalent.
  Hard to tell how much of win from quorum writes vs sending log only.
  Maybe mostly due to much less data being sent.
  Apparently network capacity is a primary limiting factor.
    Since the design reduces network use, but increases CPU and storage use.

What's Aurora's storage fault tolerance goal?
  be able to write even if one AZ entirely dead
  be able to read even if one dead AZ and one more storage server dead
    no writes in this case
    but can still serve r/o requests, and create more replicas
  called "AZ+1" -- read even if failure of one AZ plus one more server
  tolerate temporarily slow replicas smoothly (a big deal)
  repair dead replica very quickly (in case another failure!)

Failure of the database server is a separate issue.
  A new database server is started on a new EC2 instance.
    Who/what decides when to do this?
    How is split brain avoided for the DB server?
    Paper doesn't say much about this.
  New DB server recovers from log + data on the storage servers.

Quorum read/write technique
  this is an old and much-used idea (Gifford SOSP 1979)
  the goal: fault-tolerant storage,
    read latest data even if some failures
  usually applied to simple read/write (put/get) storage
  N replicas
  writes to any W replicas
  reads from any R replicas
  R + W = N + 1
  [diagram with overlap]
  simplest case: N=3, W=2, R=2
  overlap ensures read quorum includes >= 1 server from previous write quorum
  how to decide which is the most recent copy?
    cannot vote on content!
      only one server in read quorum might be up to date
    writer assigns increasing version numbers to its writes
    storage servers must remember version of each item
    reader takes max version # of the R copies it receives
  what if a read or write can't assemble a quorum?
    keep trying.

What is the benefit of quorum read/write storage systems?
  In contrast to e.g. chain replication.
  Smooth handling of dead or slow or partitioned storage servers
    No need to wait, no need to detect failure, no timeout
    Important for Amazon for remote AZs and temporarily slow servers
    No risk of split brain among the storage servers
  Can adjust R and W to make reads or writes faster (but not both).
  But:
    Servers not in write quorum must catch up (e.g. Aurora's gossip).
    Can only tolerate min(N-W, N-R) failures.
    Requirement for version numbers makes it most suitable for single writer.

Note
  Raft uses quorums too
    Leader can't commit unless majority of peers receive
    R/W overlap guarantees new Raft leader sees every committed write
  Raft is more capable:         
    Raft handles complex operations, b/c it orders operations.
    Raft changes leaders automatically w/o split brain.
    
What N,W,R does Aurora need?
  How about N=3, W=2, R=2?
    With each replica in a different AZ?
    Doesn't fulfil AZ+1 goal, since that leaves only one replica.
  Aurora uses N=6, W=4, R=3
    two replicas in each of three AZs
    one AZ failure leaves a write quorum (4) intact
    one extra failure leaves a read quorum (3) intact

What does an Aurora quorum write look like?
  Not a classic quorum write:
    DB server's writes to storage servers do *not* modify existing data items!
  A write consists of a new log entry.
    For an in-progress transaction.
    Or a commit log record.
  Aurora sends each log record to all six storage servers.
  Transaction commit waits for all of xaction's log records to be on a quorum.
    Stated as VDL >= transaction's last record.
    So every xaction through this one can be recovered after a crash.
  Commit ==> release locks, reply to client.

What do storage servers do with incoming log entries ("writes")?
  Just store them, initially.
    As a list of relevant pending log entries per data page.
  So storage server state is
    An [old] version of each data page.
    The list of log records required to bring the page up to date.
    [diagram, data pages for x and y, log entries]
  Storage server applies log records in the background.
    Or when needed by a read.

What does an ordinary Aurora read look like?
  Not a quorum read!
  The Aurora DB server needs a data page, due to a cache miss.
    Writes are log entries; reads yield data pages.
  Needs to find a storage server that has all the relevant log entries.
  Database server tracks each storage server's SCL (Segment Complete LSN).
    Each storage server has all log entries <= its SCL.
    Reports back to the database server.
  DB server reads from any storage server whose SCL >= highest committed LSN.
    That storage server may need to apply some log records to the data page.

When does Aurora use read quorums?
  Only during crash recovery.
    Crash means database server crashed, re-started on a different EC2 instance.
    New database server uses existing six storage servers.
  What has to be done for recovery?
    Nothing much for committed transactions -- already on a write quorum.
    What about transactions interrupted by a crash?
      Cannot be finished b/c database server has lost its execution state.
  DB finds highest log record such that log is complete up to that point.
    Log record exists if it's on any reachable storage server.
    The VCL (Volume Complete LSN) of Section 4.1.
  DB tells storage servers to delete all log entries after the VCL.
  DB issues compensating "undo" log entries for transactions that
    started before VCL but didn't commit.

What if the database is too big for a replica to fit into one storage server?
  Data pages are sharded into 10-GB Protection Groups (PGs).
  Each PG is separately replicated as six "segments" (on six replicas).
    Different PGs likely to be on different sets of six storage servers.
  [DB server, two PGs on 12 storage servers]

How is the *log* stored w.r.t. PGs?
  DB server sends each PG just the log records relevant to that PG's data pages.

How to restore replication if a storage server permanently dies?
  Needs to happen quickly to minimize exposure to another failure.
  The dead storage server held replicas of perhaps 1000 PGs (10 TB).
    Copying 10 TB from one server to another would take 10,000 seconds!
  For the various PGs,
    their *other* replicas are likely on different storage servers, and
    their newly created replicas can also be on different storage server.
    [S stores PGs a, b, and c, each w/ replicas on different other servers]
  So each PG can re-replicate fully in parallel:
    Each PG can be sent from a different source server, to a different dst.
    Up to the number of storage servers.
  If lots of storage servers, the *whole* job takes only 10 seconds.

What about the read-only replicas ("Replica Instance") in Figure 3?
  Clients can send read-only requests to them.
    To reduce load on main database server, which is likely a bottleneck.
    Results may not reflect recent committed updates!
  Read-only replicas read data pages from the storage servers (but don't write).
    And cache these data pages.
  Main DB server sends log to replicas.
    Replicas use log to keep cached pages up to date.
  Problem: read-only replicas do not know what is locked by transactions.
    Solution: replicas ignore or un-do writes of uncommitted transactions.
      They can tell from commit indications in the log stream.
  Problem: need to prevent replica from seeing B-Tree &c in intermediate state
    Solution: atomic mini-transactions within larger transactions.

What are the take-away lessons?
  How write-ahead logging interacts with storage systems.
  Quorums for smooth handling of slow/failed replicas.
  Cross-layer optimization -- DB and storage servers.
    Yields great performance improvements.
    But storage isn't general-purpose.
  Tidbits about primary concerns for cloud infrastructure.
    AZ failure, thus need multi-AZ replication.
    Transient slowness of individual replicas.
    Network a worse constraint than CPU or storage.