the two-phase commit protocol

This is a quick (and a year old) write up of how I understand two-phase commits. A project I was once working on at the Data Systems Group at Waterloo involved understanding, benchmarking, and improving a data migration/load balancing system that eschews two-phase commits due to low latency requirements. There were other systems like FaRM that use them. I've also come across the term before and tried to reason about its correctness, so I figured I'd start by spending some time understanding them.

a transaction

A transaction is a sequence of one or more changes to a database that must be treated as one change. Here's a simple example of a database for a store. You probably have a table "money" to maintain how much money you have and a table "inventory" to maintain an inventory of whatever you're selling. So if you sell, say, three apples for $2 each, two things need to happen: you need to update "inventory" to indicate that the number of apples has reduced by three and update "money" to add $6 to your previous total. Both or none these changes need to take place for your database to make sense. If you only credit your account with $6 or only reduce the number of applies, the data integrity of your data is lost. This all-or-none property is called atomicity.

There are some other properties transactions must satisfy to maintain database integrity:

  • When a transaction completes, it must satisfy all the invariants of the database. For example, if an internal data structure like a skip list or B tree is used, all its invariants must hold at the end of a transaction. Or if you're using a relational database in which the tables must satisfy some rules, those rules should be satisfied at the end of a transaction.
  • When a transaction completes, its changes must persist (be visible even when you turn the machine on and off again)
  • If multiple transactions are allowed at once, they must operate independently of one another. The intermediate state of one transaction should not be exposed to one another.

a distributed transaction

A distributed transaction is a transaction in which multiple hosts participate. You can imagine ordering something online: when the server registers your purchase, it debits your bank account and sends a shipping order to UPS or Fedex. Both those changes need to happen. It doesn't make sense for only one of them to take place.

the protocol

Every distributed transaction has a coordinator (which oversees the transaction) and multiple participants. The coordinator directs the participants to complete the transaction in two phases: the prepare phase and the commit phase.

Prepare Phase

To execute a transaction T, the coordinator sends a prepare to commit T message to all participants and waits for an acknowledgement or abort message from all of them. By ack-ing a prepare message, a participant indicates that it is ready to participate in a transaction in a fault tolerant way. It does this by writing the new state of that transaction to a write-ahead log which is persisted. By writing to a write-ahead log, it is possible to

  • revert to an earlier state (in case T is aborted)
  • complete the transaction (even in the case of a crash)

If any participant returns an abort message (if it doesn't recognize T, if the operation is illegal, etc), the coordinator terminates the entire transaction.

Commit Phase

Once the coordinator has received an ack from all participants, it sends them the commit T message. When participants receive a commit message, they perform any data changes that need to be made, persists those changes to memory, release any acquired locks over the data, and reply with a committed T message. After every participant has replied with committed, the coordinator marks T as completed.

  • If the coordinator doesn't hear from a participant within some amount of time, it resends the commit message.
  • If any participant responds with an abort message, the coordinator sends abort messages to all other participants and all participants revert to their previous state.

correctness

To establish correctness, we have to revisit the properties a transaction must satisfy (in bold below):

All or nothing: If one participant completes the transaction, all other participants do

If a participant completes a transaction, then it was previously in the commit phase. It got there because the coordinator sent it the commit message which is only possible after all participants acknowledge a "prepare to commit" message from the coordinator. The significance of this ack is that the participant commits the transaction in a fault tolerant way. Even if the participant shuts down, when it comes back up, it will complete the transaction since it committed it to its write-ahead log in the prepare phase.

Similarly, if a participant aborts, you can argue that all other participants will also abort.

You can also use a similar (but slightly modified) argument to show that if a coordinator completes a transaction, all the participants must have also completed it.

Durablity: If a transaction is completed, it is persisted

A transaction is marked completed at the top-level if and only if all participants have responded with a committed message. Committed messages are sent by a participant only after database changes are persisted. So if a transaction is completed by a coordinator, all participants must have persisted its associated changes.

Consistency: The paper I linked above also discusses transaction termination—all transactions will either complete or abort. No other intermediate state is possible.

notes

  • There's a dependency on the coordinator here. If it fails, I'm not exactly sure what would happen, but it can't be good—loss of crucial state or even deadlock if the coordinator isn't revived? There are fully distributed protocols that ensure full completion of transactions like Paxos (used in services like Google's Chubby and Spanner) or Raft (used in etcd).
  • The two-phase commit protocol is also pretty slow. Firstly, it's a blocking protocol. The coordinator blocks while waiting for all the participants to respond, and the participants block waiting for the coordinator to initiate the commit phase. Secondly, if a participant doesn't respond in the commit phase, the coordinator would have to timeout and resend the commit message and this could potentially happen more than once, depending on config. If a transaction aborts, the coordinator might also retry the entire transaction. Overall, I don't think this would scale to a situation where you have thousands of participants for one coordinator. A three-phase commit solves some of these problems.
  • I didn't want to explicitly talk about it here, but it's worth thinking about why a single-phase commit doesn't provide the correctness that a two-phase commit does.
  • I didn't delve into the isolation property too much: I have only a cursory understanding of it at this point. Databases seem to provide varying degrees of isolation too.
  • Most of what I learnt about correctness, I read in Nested Transactions: An Approach to Reliable Distributed Computing. It's pretty long so I didn't read all of it. Pages 90-96 are most relevant, but reading 70-90 is useful too—I just skimmed that for now.