> Is R* still used in its raw form or was it just more inspiration for
> future systems?


R* was an experimental research system; I suspect it was never used
commercially. However, it and the system it was based on (System R) were
very influential; many subsequent commercial systems borrowed their
ideas.


> What kind of current day systems use 2PC or presumed-abort/commit


Many modern databases support distribution -- sharding the data across
many servers to achieve parallel speedup and thus be able to support
high loads. These databases use 2PC to implement transactions that use
records on multiple shards. If you search the web for "oracle presumed
abort" or "ibm presumed abort" you'll see hints that their commercial DB
products use PA.


> Given that this paper was published in 1986, the systems in question
> are rather dated. Are there newer systems that embrace deadlocking
> databases?


Every database system that supports transactions has to cope with
deaadlocks. The problem and set of possible solution designs hasn't
really changed since R*. If you poke around the web you'll find plenty
of material about deadlocks in modern databases such as IBM DB2.


> My question is more of a meta question. I'm wondering why we seem to
> have gone back in time to study this older system? It seems like we've
> been on a trend to understand more and more recent things.


There are ideas that are important today that were invented a long time
ago. Two-phase commit was invented at least 30 years ago, but is still
important in today's distributed systems.

Usually the reason we assign an old paper is that we want to talk about
an important idea that's old, and there are no new papers that do a good
job of describing the idea. Usually new papers assume the reader already
understands old ideas.


> Could it be that it is interesting to see an early version of a
> distributed transactional system?


I find it interesting to read old papers on topics like this, though
they can be confusing when the authors have unstated assumptions that
are hard to reconstruct decades later.


> Why is it so important for the transactions to be atomic?


What "transaction" means is that the steps inside the transaction occur
atomically with respect to failures and other transactions. Atomic here
means "all or none". Transactions are a feature provided by some storage
systems that makes programming easier. An example of the kind of
situation where transactions are helpful is bank transfers. If the bank
wants to transfer $100 from Alice's account to Bob's account, it would
be very awkward if a crash midway through this left Alice debited by
$100 but Bob *not* credited by $100. So (if your storage system supports
transactions) the programmer can write something like

BEGIN TRANSACTION
  decrease Alice's balance by 100;
  increase Bob's balance by 100;
END TRANSACTION

and the transaction system will make sure the transaction is atomic;
either both happen, or neither, even if there's a crash.


> What is the advantage of this kind of consensus algorithm over one
> that uses majority agreement?


Distributed atomic commit causes different computers to do *different*
things (e.g. Alice's bank debits Alice, Bob's bank credits Bob), and we
want them *all* to do their thing, or none of them. Atomic commit
systems are typically not available in the face of failures, since they
require all participating computers to perform their part of the
transaction.

Consensus (e.g. Raft) causes a set of computers (replicas) to all do the
*same* thing (so they remain replicas). Further, we're willing to settle
for only a majority, because we want the system to be available in the
face of failures.

You should view two-phase-commit and Raft as solving different problems.


> How many subordinates should such a system use to be both reliable and
> effective?


The number of subordinates is determined by how the database's data is
divided among servers, and by how many data records a given transaction
uses.

If the load is very high, or the amount of data is very large, then the
database will need to be sharded among many servers that can execute in
parallel. A very big database might shard its data over dozens or
hundreds of servers.

The number of subordinates a given transaction involves is set by the
number of shards from which the transaction uses records. If a
transaction only looks at two database records, then at most the
transaction will use two subordinates. If a transaction uses thousands
of database records, it could involve all the servers as subordinates.

The point of using multiple servers here is not to increase
availability. It is to achieve higher throughput via parallel execution.


> The paper repeatedly refers to "forgetting" committed and aborted
> messages. It always uses the term in quotes, and I'm not entirely sure
> what it means by that. Does it just mean removing any pending-status
> the transaction had (which would matter during recoveries)?
>
> In context: "force-writes a commit record, sends an acknowledgment
> (ACK) message to the coordinator, and then commits the transaction and
> 'forgets' it."


The paper uses "forget" to mean deleting all record of the transaction
(though any writes to the DB the transaction performed remain). So if
the TC has forgotten about a transaction, and a subordinate asks about
the transaction, the TC won't have any record of the transaction; much
of the paper is about how to proceed correctly in such situations.
Forgetting is important because otherwise the TC and subordinates would
have to remember every transaction, which would be awkward after years
of operation and millions of transactions.


> My Question: Where does the recovery process live? Is there one per
> server? If not, how are messages sent to this recovery process?


Each server has a separate recovery process.


> For the 2PC, are the recovery process always on (the virtual storage
> always being kept updated with stable storage)? If it only starts
> reading logs at the time of failure, is it efficient?


A server's recovery process only runs if something goes wrong (a timeout
or crash+reboot). Recovery can be slow, since it may need to read state
from the disk log, or send queries to other servers. But the hope is
that recovery only needs to be run rarely.


> Another quesiton: How would some servers say the transaction is
> read-only and others not?


Each subordinate performs a different part of the transaction; some
parts may involve just reads, others may involve writes too. For
example, the transaction may look like this:

BEGIN TRANSACTION
  if 6.824 has fewer students than 6.830:
    enroll student in 6.824
  else:
    enroll student in 6.830
END TRANSACTION

Suppose the 6.824 information is stored on one server, and the 6.830
information on another. If the "if" is true, then the transaction reads
the 6.830 information but doesn't write it. Thus the part of the
transaction done at 6.830's server is read-only. But the part done on
6.824's server is read/write.


> What sort of systems use the hierarchical 2 phase commit described in
> the paper?


I suspect the reason is to cope with SQL statements (or query plans)
that involve sub-queries. For example, suppose table ttt is sharded
across many servers. If the program says

  BEGIN TRANSACTION
    UPDATE ttt SET x = (SELECT ...)
  END TRANSACTION

Then each server's share of the work requires it to run inner SELECTs to
calculate the new column x value for each row. Those SELECTs may need to
be sent to other servers, and those other servers will need to acquire
locks, contribute yes or no when the TC decides whether to commit, and
release locks after the final COMMIT or ABORT. The natural
implementation of this is a hierarchy.

A JOIN of two tables can also require each server to gather data from
other servers, and thus might be a natural fit for hierarchical
transactions.


> Q: Why would a subordinate send a NO VOTE to the coordinator on a
> regular run?

This is a good question! The reason we care most about is if the
subordinate crashed and rebooted after it did some of its work for the
transaction but before it received the prepare message; during the crash
it will have lost the record of tentative updates it made and locks it
acquired, so it cannot complete the transaction, and must vote NO.
Another possibility (depending on how the DB works) is if the
subordinate detected a violated constraint on the data (e.g. the
transaction tried to write a record with a duplicate key in a table that
requires unique keys). Another possibility is that the subordinate is
involved in a deadlock, and must abort to break the deadlock.

One way to make atomic commit faster is to restrict transactions so that
subordinates cannot say NO in some or all situations. The read-only
transactions in the paper's Section 3 are like this. This paper
identifies more such streamlined situations:

https://www.usenix.org/system/files/conference/atc12/atc12-final118.pdf


> Are there any cases where you wouldn't want to use presumed abort over
> the initially proposed protocol?


I think Presumed Abort only differs from 2P when the transaction is
partially or wholly read-only, and in that case PA seems strictly more
efficient. So I don't think there's any situation where you'd prefer 2P
to PA.


> Who exactly plays the role of coordinator in the distributed
> transactions? The client, a third party "arbiter", or someone else? Is
> this possible to instead having an arbiter to issue the commit
> requests and respond to the client?


There are two common arrangments. One possibility is that a server acts
as the TC (transaction coordinator); it's particularly convenient for
one of the servers participating in the transaction to also act as TC,
since that saves a few messages. Another common setup is for the client
to also act as TC, again saving a few messages, though it may be tricky
to ensure that a client acting as TC has the required crash recovery
properties.


> In section 2.2, the paper discusses subordinate failure and recovery
> process, but not for the coordinator.


Section 2.2's "recovery process" runs on both the TC and on the
subordinates. For example, the text in 2.2 "If the recovery process
finds a transaction in the committing state...", that applies to the TC
(if the TC got YES votes from all subordinates, logged a commit record,
and then crashed).

The paper's text is confusing because it's written in a way general
enough to apply to a hierarchy of servers (root TC, non-leaf
subordinates that partially act as TC, and leaf subordinates).


> Moreover, it doesn't discuss what happens when a subordinate is trying
> to contact a coordinator during recovery and the coordinator fails.


If the subordinate needs to contact the TC, the subordinate must keep
trying until the TC answers.


> How come the recovery process cannot distinguish the role (coordinator
> versus subordinate) based on the inquiry?


I think the situation the paper has in mind is that a server has
altogether forgotten about the transaction. In a hierarchical
transaction scheme, servers in the middle of the hierarchy act as both
coordinators (for their subordinates) and subordinates (to the
coordinator above them). So it may not be clear in which capacity the
forgetful server is being queried.


> How can a coordinator, upon restart, abort a transaction and not
> inform any of the subordinates (Section 2.2, last paragraph)?


I think the paper means "the TC forgets about the transaction during the
crash, because it loses the contents of RAM." The TC doesn't do anything
active to abort the transaction (since it no longer even knows the
transaction existed), but will reply ABORT to any query about the
transaction.


> I couldn't find any mention about what happens if ACK messages are
> lost in transit between subordinate and coordinator. Does the
> subordinate retry at some point? Does the coordinator retry sending to
> the subordinate at some point?


I believe the TC re-sends the COMMIT message until the subordinate
answers. The TC needs to be sure the subordinate has heard (and logged)
the COMMIT before the TC can forget about the transaction.


> The two-phase commit paper does not mention replicating data on
> individual subordinates, so that a subordinate's failure can be
> withstood by the distributed database system without it voting NO for
> every transaction. How would 2PC work on top of a system of
> subordinates running Raft/Paxos, to increase the stability of
> subordinates in the 2PC protocol?


You are right that 2PC doesn't provide availability, since there are
situations where it must block waiting for a crashed server to reboot.
If you want to continue despite any single failure you must layer 2PC on
top of a replication scheme like Raft or Paxos, so that the TC and each
subordinate is replicated. The TC and subordinate must be implemented in
a way that's compatible with Raft/Paxos, which is easy in principle but
(as you'll see in Lab 3) requires following some rules. You'll get a
little experience with this in Lab 4, where there's a specialized
two-phase-commit-like protocol involved in moving a shard of key/value
pairs from one replication group to another.


> In the discussion of the deadlock detector, the authors state that
> because they do not expect deadlocks to occur frequently, they treat
> every detected deadlock as a true deadlock. Is this a fair assumption
> (in reality how frequently do deadlocks occur) and are there any
> potential dangers to a false positive detected deadlock?


My experience is that deadlocks are rare. Further, I suspect that if
someone has a workload in which they are frequent, they will probably
re-write their transactions to have fewer deadlocks (perhaps by
acquiring locks in the same order in every transaction). The danger of
false positives is that transactions will be needlessly aborted and have
to be re-executed, which wastes time.


> Also, the paper mentions "victim transaction abort." If deadlocks
> don't occur infrequently, isn't this a big hit to reliability?


Typically the client software re-executes the transaction if it's
aborted.


> Consider the modified example in the lecture notes:
>
>   x and y are bank balances
>   x and y start out as $10
>   T1 is doing a transfer of $1 from x to y
>   T1:
>     add(x, 1)  -- server A
>     add(y, -1) -- server B
>   T2:
>     tmp2 = get(y)
>     tmp1 = get(x)
>     print tmp1, tmp2
>
> The modified part is that the first two instructions in T2 were
> swapped. Is it possible that during the first 2-phase commit, T1
> aquired the lock of x and T2 aquired the lock of y, and during the
> second 2-phase commit, T1 aquired the lock of y and T2 aquired the
> lock of x, and eventually T2 outputs 11, 10?


This example will deadlock. Since locks x and y are already held, T1 and
T2 will both wait forever on their second lines. In R*, eventually the
deadlock detector will abort one or the other.


> What will happen if a subordinate sends an ACK before writing the
> COMMIT record?


This can't happen because R* writes the COMMIT record before sending the
ACK.

Suppose it could happen, and the subordinate crashed after sending the
ACK and before writing the COMMIT. The TC would receive the ACK and
forget about the transaction. On restart, the subordinate would see the
PREPARE record in its log but not a COMMIT record, and it would ask the
TC whether the transaction committed or not. By the paper's rules for
2P, the TC would reply ABORT because the TC has no record of the
transaction. This is a serious error because the transaction actually
committed.


> How do the actual DB modifications/reads actually fit into this?
> Basically, I'm wondering which fo the following two subordinates are
> doing: 1) within the PREPARE force written to the log, they write a
> number of REDO log operations, describing the ops the transaction
> should take, then, upon receiving the COMMIT signal, they apply those
> operations, and force write the COMMIT operation with a variety of
> UNDO log operations (or, perhaps no new undo/redo log entries) and ACK
> 2) prior to force writing the PREPARE to the log, they apply the
> operations in the transactions, then force-write a PREPARE with UNDO
> log entries, then send the YES back to the coordinator. Upon receiving
> a commit entry, they write the COMMIT operation as above.
> Or perhaps they are doing something else? I am rather confused
> regarding a number of their implementation details.


A subordinate applies the updates to the DB after it writes the
COMMIT to its log.


> What happens if the subordinate writes a COMMIT, then crashes before
> sending the acknowledgement ACK? What prevents the coordinator from
> having to wait forever for the acknowledgement?


The TC will keep re-sending the COMMIT until the subordinate reboots and
responds with an ACK. If the subordinate never reboots, the TC must
wait forever.

The goal of two-phase commit is not to ensure continuous availability
even if some servers are down; instead, it assumes all servers will
eventually be live, and ensures that (once they are alive) they all
arrive at the same commit vs abort decision.

If you want both high availability and atomic commit, you can layer
two-phase commit on top of Raft-replicated services (i.e. use Raft to
replicate the TC and each subordinate).


> They seem to treat force-writes as atomic but what happens if there is
> a crash during the force-write? Why can they make the assumption that
> these operations are atomic?


A disk write may not be atomic, for example if you kick a mechanical
drive while it is writing. Then the sector being written may end up with
a mix of old and new data. The disk will detect this problem if it reads
the sector, since the per-sector checksum will fail.

Two-phase commit software can cope with failed disk writes by checking
that the write completed correctly (either the disk tells it, or the
software reads the data back from the disk). If the write completed, the
subordinate can respond to the PREPARE or COMMIT message. If the write
didn't complete, the software should re-try the write.


> What happens during the time of coordinator failure? Would the system
> become unavailable?


Yes: there are points in the protocol where if a TC or subordinate
fails, the other servers must wait (perhaps for a long time) for the
failed server to reboot.


> What hardware is R* implemented on for the purposes of this paper?

I don't know! Probably IBM mainframe hardware from the 1980s.