Oops, Why My Data is Gone!? One of The Most Serious Problems in Distributed Database Systems
date
Jun 1, 2024
slug
distributed_system_consistency_en
status
Published
tags
Distributed System
Database
summary
In the very first lecture of MIT's 6.824 Distributed Systems course, Morris emphasizes that unless absolutely necessary, one should never consider making a system distributed, as it complicates many things. However, with increasing demands for data volume, availability, and scalability, distributed systems seem to become the only solution. That brings along numerous challenges: many tasks that are simple on a single machine suddenly become problematic when multiple machines have to collaborate together. Database systems, especially OLTP, are full of these problems because they have zero tolerance for "errors".
type
Tech
lang
en
In the very first lecture of MIT's 6.824 Distributed Systems course, Morris emphasizes that unless absolutely necessary, one should never consider making a system distributed, as it complicates many things. However, with increasing demands for data volume, availability, and scalability, distributed systems seem to become the only solution. That brings along numerous challenges: many tasks that are simple on a single machine suddenly become problematic when multiple machines have to collaborate together. Database systems, especially OLTP, are full of these problems because they have zero tolerance for "errors". This article will analyze how these database systems like Spanner, Cockroach DB, TiDB, and Calvin solve these issues and guarantee correctness to their customers.
Speaking of “correctness”, we should definitely brought up the topic of consistency, so what is consistency exactly?
What is ConsistencySpanner - Elegant!System ArchitectureTrue Time APISpanner’s IdeasCockroach DB - Good Enough Is EnoughHybrid Logical ClockCockroach DBTiDBAvailabilityPerformance TuningCalvinSystem ArchitectureNo More Two Phase Commit🧐My Personal Opinion
What is Consistency
There are various and often complex definitions of consistency online. I think the most intuitive explanation is this: regardless of how many machines are working together behind the system, it should look, feel, and function as if it were just a single machine. This means:
All operations have an order, and this order aligns with the time sequence of these operations. For example, if a user first updates A and then updates B, the system must process A first, followed by B.
This seems like a basic requirement for a system, but it's not as simple as it looks in a distributed environment. Imagine a scenario: you update x on server A, then read x on server B within the same system. Can you be sure you'll read the latest value of x? It seems like some extra work might be needed, right? For instance, when A updates x, it might need to replicate the update to as many servers as possible, or when reading x, it might need to read from multiple servers to ensure getting the latest value. You might think, "That's easy too: I'll just write to all machines every time I write, and read from all machines when I read. Wouldn't that solve all problems?" But the problem is that if one machine suddenly crashes, the entire service goes down. Modern distributed systems typically involve thousands of servers, and you never know which server might go offline. This approach would lead to extremely low system availability.
Therefore, ensuring consistency is not an easy task for databases, and many databases (NoSQL) simply abandoned consistency altogether, inventing weaker consistency concepts such as "eventual consistency". Why do they not implement consistency in their systems? Many believe it's due to CAP, as CAP states that in the event of network partitions, one must choose between 100% consistency or 100% availability. However, in my opinion this is inaccurate
Any distributed system or data store can simultaneously provide only two of three guarantees: consistency, availability, and partition tolerance. - - - CAP theorem
CAP is a great theory, but many people have applied it in the wrong direction. Many posts interpret CAP as if C and A are mutually exclusive concepts, and you must guarantee either C or A (because logically, we want distributed systems to function normally even in the case of network partitions). So they try to categorize any distributed system as either C or A. But this categorization method clearly underestimates the complexity of distributed systems. CAP only proposes an ideal world where network partition is the only problem that occurs, and the solution to this problem is to either guarantee 100% availability or 100% consistency. In reality, there are many more problems that can occur in distributed systems, with server crashes or software bugs being dominant. Second, distributed systems have never had 100% availability; no cloud service provider claims their service availability is 100%. The most common way to describe availability is using the number of nines: how many nines does the service availability have? Is it 99.99% or 99.999%? So even if NoSQL gives up consistency guarantees, it doesn't mean their availability is 100%. Similarly, New SQL "giving up" availability doesn't mean their availability is 0. For example, Spanner, which many categorize as a CP system, has an availability of 99.999%, while DynamoDB, which many categorize as an AP system, also has an availability of 99.999%. The fundamental reason why CAP theory can't categorize modern distributed systems, in my opinion, is that the most common problems in cloud services are not network partitions, but server crashes or software-level bugs.
So why did NoSQL databases give up on guaranteeing consistency? I think it's more about performance and cost considerations. For example, if there are f servers for fault tolerance, Dynamo (Not DynamoDB since DynamoDB uses Paxos later on) only needs at least f+1 servers (depending on user requirements), while Spanner needs at least 2f+1 (due to using Paxos).
I’d like to make a little bit of modification on top of the CAP triangle, which I call the CPC Triangle. This concept stands for Consistency, Performance, and Cost. Much like the famous CAP theorem, the CPC Triangle suggests that when designing distributed systems, you can only optimize for two of these three aspects at any given time (unproven, just my thoughts).
- Consistency (C): Ensuring all nodes in the system have the same data at the same time.
- Performance (P): The speed and efficiency of the system.
- Cost (C): The cost implications of the chosen architecture.
For example, if you choose consistency and performance, you have to sacrifice some cost (like in Spanner). Alternatively, if you choose performance and cost, you have to sacrifice consistency (like Dynamo). Or, if you choose cost and consistency, you might need to give up some of the performance (like Chain Replication)
Understanding these concepts makes it easier to see why NoSQL sacrifices consistency: it provides availability at a lower cost. On the other hand, NewSQL is willing to spend more money to achieve greater availability while still ensuring consistency. This is a challenging goal, and various database vendors have come up with different and interesting solutions, which are worth discussing.
Spanner - Elegant!
In simple words, Google aims to create a consistent, globally distributed SQL database with two requirements:
- Maximize its scalability and availability as much as possible
- Ensure optimal read performance to the greatest extent possible
Google's approach is extremely elegant. It leverages its unique cloud hardware advantage to guarantee high availability of the system with minimal performance cost.
System Architecture
As shown in the figure, Spanner's architecture is very clear. Tablets are responsible for storing data in a Key-Value format and persisting the data in Colossus (Google's internal file system). To maximize read performance, Spanner uses MVCC+2PL (multi-version concurrency control with 2 phase locking), so that read-only transactions do not conflict with other transactions. Specifically, every Key inserted will carry a system-generated timestamp to indicate the commit time of that data.
All newly written data will be replicated to all machines through the leader using Paxos protocol. This approach has several benefits: first, it ensures consistency. Second, it also improves system availability, because as long as more than half of the servers are running normally, Spanner will continue to operate. Third, these servers are also distributed across different data centers, so even if one data center is down, it won't cause the entire database to crash.
The transaction management logic is implemented on the Paxos leader node.
The logic of a transaction works as follows: when a transaction request reaches the Paxos leader, the leader assigns it a timestamp representing the current time. All data written by this transaction will carry this timestamp and be replicated to all machines. At the same time, when reading data, it can only read the "latest" data with a timestamp less than or equal to its own. Let's take a simple example, assuming we have the following database:
a@09:00 | 1 |
a@10:00 | 2 |
a@12:00 | 3 |
b@11:00 | 100 |
- Tx1@15:00-02-15-2024
- read
a
- write
b = a + 1
- Tx2@15:01-02-15-2024
- Write
a = 4
Suppose Spanner is processing Tx1 and Tx2 simultaneously:
Because Tx1's timestamp is 15:00, it can only read
a
@12:00 = 3, and finally writes b
@15:00 = 4Meanwhile, Tx2's timestamp is 15:01, so it will write
a
@15:01 = 4a@09:00 | 1 |
a@10:00 | 2 |
a@12:00 | 3 |
a@15:01 | 4 |
b@11:00 | 100 |
b@15:00 | 4 |
From this example, we can see that reads and writes do not conflict at all, and all transactions are ordered and follow the sequence in which they occurred (Tx1 then Tx2).
A simple, highly available, strongly consistent database has been built. It seems quite straightforward, but this architecture is not sufficient for Google use cases because it cannot support massive-scale requests. As a solution, Google adopted a sharding method for horizontal scaling. They replicated the above architecture dozens of times, with each replica responsible for only a small portion of all the data. If a transaction involves updating multiple shards, it will ultimately use 2PC (two phase commit) to commit.
This type of sharding mechanism greatly improves the system's request processing speed and throughput, but it brings a serious consistency issue. Suppose I have a transaction that updates three partitions (A, B, C) together. The timestamp for this transaction is then chosen after negotiation among the three shards. Generally speaking (and this is Spanner's approach), each of the three partitions generates a timestamp, and then the latest one is selected. The timestamp generated by each partitions will have slight differences from the real-world time because different machine system times have offsets (Clock Skew). Specifically, suppose the real-world time is 15:01. Partition A's system time might be 15:00, but Partition B's system time might be 15:02. We have no way to make all machines in a distributed system generate exactly the same timestamp, and these few tens of milliseconds of deviation will completely destroy the consistency guarantees we previously architected.
Suppose I have two transaction going on:
- Tx1@09:02 (real world time 09:00):
- read
a
(Partition A) - read
b
(Partition B) - read
c
(Partition C) - … (Other operations)
- write
c
= c + a + b (Partition C)
- Tx2@09:01 (real world time 09:04)
- read
d
(Partition D) - read
e
(Partition E) - read
c
(Partition C)
Looking at the real-world timestamps, Tx1 was executed before Tx2, but according to the timestamps given by the database, Tx1's timestamp is later than Tx2's timestamp. This will result in Tx2 being unable to read the data written by Tx1, because Tx2 can only read data with timestamps earlier than or equal to its own. This is a serious data inconsistency, as users may not be able to read previously written data, creating an illusion of "data loss" for users.
Let's analyze why the problem happens:
- The actual time when a transaction occurs and the timestamp database actually generates are different. The timestamp acquired by the transaction is negotiated among multiple partitions, and the chosen timestamp has an offset from the real time. If one transaction’s timestamp shifts forward and another transaction’s timestamp shifts backward, it may cause inconsistency.
- Two different transactions only update partially overlapping partitions. If two transactions update exactly the same partitions, the consistency problem would be as easy to solve as if there were no partitions. However, if they only update partially overlapping partitions (ABC, CDE), Tx1's timestamp might be determined by Partition A, while Tx2's timestamp might be determined by Partition D, which would lead to inconsistencies.
So the problem can be simply summarized as: how to determine the order of transaction occurrence in a distributed system. You might think this is easy, that using Vector Clock would suffice, but Spanner didn't adopt such a method. The paper didn't give specific reasons, but I believe it's because Vector Clock doesn't have high scalability, as the length of the vector grows proportionally with the number of nodes in the entire system. Google's solution is to leverage their unique hardware advantage.
True Time API
Before introducing Google's solution, let's first discuss Google's hardware advantage. On the servers in Google's data centers, there is a special system API called TrueTime API
struct TTinterval { Timestamp earliest; Timestamp latest; }; TTinterval curr = TT.now(); { // if curr timestamp is definitely passed bool res = TT.after(curr); } { // if curr timestamp is definitely not passed bool res = TT.before(curr); }
Whenever we call
TT.now()
, the TrueTime API returns two values, earliest and latest, representing that the real-world time is definitely between these two values.TT.after(t)
means that the real-world time is definitely later than t, while TT.before(t)
is just the opposite. I'm not very clear about the specific implementation method of True Time; the paper mentions using GPS and atomic clocks, which I don't quite understand. But it doesn't matter. We need to focus on how to use this API to solve problems.Spanner’s Ideas
Spanner still uses the same system architecture and the same methods for adding, deleting, querying, and modifying, but the timestamp is entirely provided by the True Time API. Although this timestamp is also offset, its offset is capped, as the True Time API guarantees that the offset will never exceed the value from
TT.now().latest
. Spanner employs a clever trick based on this.Same example:
- Tx1@09:02 (real world time 09:00):
- read
a
(Partition A) - read
b
(Partition B) - read
c
(Partition C) - … (Other operations)
- write
c
= c + a + b (Partition C)
- Tx2@09:01 (real world time 09:04)
- read
d
(Partition D) - read
e
(Partition E) - read
c
(Partition C)
First, when a transaction commits, all involved partitions will call
TT.now()
separately, and then the transaction will choose the largest TT.now().latest
as the timestamp. However, this doesn't solve the time offset problem at all, because the TT.now().latest
produced by different partitions also has offsets. Suppose Tx1 uses the timestamp generated by partition B, and Tx2 uses the timestamp generated by partition D. In this case, as shown in the figure, Tx2 occurs after Tx1 commits, but Tx2's timestamp is smaller than Tx1's timestamp. This results in Tx2 being unable to read the data written by Tx1.
To completely resolve this inconsistency, an additional step is needed during each commit. Spanner calls this commit-wait. Every time a transaction is about to commit, it needs to wait until
TT.after(timestamp) == true
before notifying the outside world of the commit.From the figure, we can see that Tx1 doesn't commit immediately after it ends, but waits until TT.after(TT.now().latest) == true before committing. If you simply just looking at the diagrams, you might think there's not much difference . Actually, for users, the difference is significant. In figure 1, from the user's perspective, Tx1 happens first, and after Tx1 commits, Tx2 occurs but can't read Tx1's data, so users have an illusion of data loss. However, in figure 2, from users’ perspective, Tx2 commits first, then Tx1 commits, so it's expected that Tx2 can't read Tx1's data.
Another more intuitive way to understand this: when a Tx A commits,
TT.after(TT.now().latest) == true
, which means the current partition's time has definitely passed, so transactions occurring after Tx A will definitely have a timestamp larger than that, and thus will certainly be able to read the content written by Tx A, thus ensuring consistency.Is the performance overhead of doing this significant? Actually, it's not. Google's TrueTime API guarantees that the latest deviation is very small, usually only 10ms, so transactions wait at most 10ms, which can be used for Paxos replication.
This is the core idea of Spanner, using the unique properties of the TrueTime API to elegantly solve consistency issues in distributed systems. I've only briefly discussed the general design concepts and architecture; the original paper contains more details and also discusses many optimizations, such as simplifying the process for read-only transactions. If you are interested, you can refer to the original paper for more details.
Cockroach DB - Good Enough Is Enough
Spanner has significant limitations because it relies heavily on Google's internal hardware. It can only be deployed in Google's data centers. However, many financial users are unwilling to hand over their data to Google and prefer to deploy databases on their own internal servers. Therefore, other commercial databases like Cockroach DB and Yugabyte DB took this opportunity. However, the first problem they need to solve is how to ensure consistency efficiently without relying on the TrueTime API?
Hybrid Logical Clock
Time is an illusion. – Albert Einstein
From our analysis above, we can see that the key to solving the consistency problem is to handle the order of events in a distributed system. Before TrueTime, there were many solutions, such as Lamport Clock or Vector Clock. However, these solutions have serious flaws in today's distributed systems: 1. They are all logical clocks, unable to resolve requests related to real-world clocks. 2. For Vector Clock, its space complexity is O(n), which has extremely poor scalability.
In 2014, a group of people proposed the Hybrid Logical Clock, which actually takes into account both logical and real-world clocks, solving the pain points of logical clocks:
- The output format is a timestamp, so the system can effectively resolve requests related to real-world clocks
- The space complexity is O(1), so it has high scalability
For a high level idea, it is essentially a pure software implementation of the TrueTime API, and the time deviation is much wider than TrueTime. TrueTime typically has an offset of 10ms, but the Hybrid Logical Clock can be as high as 500ms. This means Cockroach DB can't afford to do the commit wait like Spanner does, as it would take 500ms, making transaction latency terribly high.
Cockroach DB
Cockroach DB's system architecture is almost identical to Spanner's, utilizing multiple partitions with each partition replicated using Raft. The difference lies in transaction processing. Due to the potentially high offset of up to 500ms in the Hybrid Logical Clock, Cockroach DB simply abandons the commit-wait like Spanner, and even gives up on the strongest consistency guarantee. Yes that’s correct, Cockroach DB as a NewSQL database cannot guarantee strongest form of consistency (linearizability). They slightly relaxed the definition of consistency, which turns out to be acceptable by their customers. Good enough is enough!
As an example
- Tx1@09:00:10.55 (Maximum possible timestamp generated by HLC:09:00:11.05-02-20-2024)
- read
a
(Partition A) - read
b
(Partition B)
When recording the timestamp of this transaction, it also records the maximum timestamp of the transaction. The maximum timestamp records the maximum possible time offset.
MaxTimestamp = CurrTimestamp + MaxTimeOffset;
The read operation is still the same as Spanner, by using MVCC:
a
and b
will have multiple versions of values, and Tx1 will read the largest version before its timestamp. However, if Tx1 finds that the latest version of a is greater than Tx1's timestamp but less than Tx1's maximum timestamp, Cockroach DB is lost, because Cockroach DB has no way of knowing whether Tx1 should read the latest version of a or not. If this sounds a bit complex, let me give an example:- Tx2@09:00:10.65
- write
a = 5
Suppose Tx2 writes
a = 5
at 09:00:10.65, but due to the time offset, we can't know whether Tx1 should read a = 5
or not, because Tx1's maximum timestamp (the maximum possible time offset) is 09:00:11.05. So it's possible that Tx1 occurs after Tx2 (Tx1 should read a = 5
) or that Tx1 occurs before Tx2 (Tx1 should not read a = 5
) - there's no way to determine this.So Cockroach DB's solution is very straightforward: retry! If Cockroach DB detects any possible consistency conflict, it will retry, ensuring that each transaction can read the values written by previous transactions and won't read outdated values. But you'll notice that if there are two transactions and their data operations don't overlap, Cockroach DB would be unable to detect consistency conflicts and thus unable to determine the order.
- Tx3@09:00:10.55
- read
a
(Partition A)
- Tx4@09:00:10.65
- read
b
(Partition B)
In this case, Cockroach DB has no way to determine whether Tx3 or Tx4 occurred first, so it cannot provide the strongest consistency (linearizability).
Linearizability: All operations are ordered, and this order aligns with the time order of these operations. For example, if a user updates A and then updates B, the system must process A first, then B. The order holds true for every servers within the systems.
As it turns out, nobody cares. Cockroach DB's customers are from various banks and trading platforms, but they don't care about the strongest consistency. If two transactions are reading and writing completely different set of data, it does not matter that much who went first. The consistency guarantee provided by Cockroach DB is good enough, and good enough is enough.
Overall, the core idea of Cockroach DB transactions is retry. I've only briefly discussed the underlying logic. If you are interested, you can go through the source code and check out Cockroach DB's blog posts.
TiDB
TiDB's architecture is also extremely similar to Spanner's, using Raft for underlying data fault tolerance, horizontally scaling by partitioning data, and employing multi-version concurrency control. However, the specific timestamp handling is quite different. TiDB's approach is relatively straightforward and aggressive, directly adopting a centralized timestamp service. This fundamentally solves the consistency problem: all transactions’ timestamps are determined by the centralized service, so there is no clock skewness problem, and the transactions are linearizable.
However, the only down side with this approach is that it can easily become a bottleneck, limiting the database's scalability and reducing its availability, because every transaction needs to request this timestamp service. These issues are the main challenges faced by TiDB.
Availability
The first challenges faced by a centralized service is single point of failure. If the timestamp service crashes, essentially no transactions can run on TiDB. Therefore, there must be multiple machines behind this service for high fault tolerance. TiDB's central service PD (placement driver) uses etcd as the key-value store behind it, while etcd relies on Raft for high availability guarantees. So PD is essentially the Raft group providing the timestamp service.
What happens if the PD leader crashes? Then the PD servers will elect a new leader through Raft to continue providing service. However, there's an issue here: PD must ensure that all generated timestamps are monotonically increasing (otherwise consistency problems will arise). What should the new leader do? First, the new leader certainly can't directly use its local time as the timestamp. As we've repeatedly mentioned above, timestamps are skewed, and if the local time is earlier than the previous leader's time, consistency guarantees will be broken. So we need to know what the last timestamp generated by the previous leader was, and then have the new leader generate a timestamp larger than that. The solution is that whenever a PD leader generates a timestamp, it writes it replicates the timestamps to all machines through Raft. This way, the new leader can read the latest timestamp before allocating new ones, which guarantees timestamp will only move forward.
Performance Tuning
With correctness and availability resolved, performance is the next issue to address. If you analyze it a bit, you'll see that the above architecture has these performance bottlenecks:
- The leader needs to replicate generated timestamp for every transaction, which is too time-consuming.
- Every transaction needs to get a timestamp from PD, which severely impacts throughput.
TiDB's solution is very clever. First, the PD leader generates a large number of timestamps at once and slowly allocates them to transactions, so only the highest timestamp needs to be written to Raft group for replication. A single round of Raft replication can serve multiple transactions.
Second, to improve the overall system throughput, TiDB batches multiple transactions together when requesting timestamps from PD. This improves system throughput at the cost of a bit of latency overhead.
As a summary, using a centralized timestamp service can significantly reduce system complexity and engineering complexity. Although there are a series of availability and performance issues, there are many clever optimization techniques to tackle them. Alibaba's OceanBase also adopts this architecture.
Calvin
Calvin is the earliest prototype of NewSQL created by legendary database guys at Yale. Before that, distributed OLTP databases like Volt DB basically had no ability to scale horizontally efficiently. Spanner is more widely known because it was the first commercial database to solve this pain point, but in fact Calvin's paper was published a few months earlier than Spanner's. However, since it was an academic database, its influence was far less than Spanner's. Its performance is on the same level as Spanner, but its system architecture is completely different from Spanner, Cockroach DB, and TiDB.
As a side note: Daniel Abadi, one of the authors of this paper, proves that one can be both successful in academics and industry. He is one of the database legends!
System Architecture
Before understanding Calvin's architecture, I need to first introduce how Calvin solves consistency. Unlike the three papers mentioned above, Calvin doesn't use timestamps at all to solve consistency. Its solution is very aggressive, specifically using a system layer to sort all transactions within a period of time (usually within 10ms), and only after sorting does the execution layer execute them, which prevents any inconsistencies. The sorting can provide additional guarantees, so they can avoid the time on 2PC (two phase commit) across multiple partitions.
No more wasting time on 2PC across multiple partitions, which is quite important for Calvin’s design considerations.
First, unlike Spanner, Calvin doesn't use Paxos for replication and then scale through partitioning, but instead partitions first, and then uses Paxos for replication.
Calvin's system is divided into three layers
- Sequencer: sorts all transaction requests within a period of time. The sequencer exists on every server and is replicated using Paxos.
- Scheduler: executes these transactions according to their order.
- Storage: stores data to disk. Any key-value store with CRUD capabilities will do, such as RocksDB.
It looks similar to Spanner, so why does Spanner need to do 2PC while Calvin doesn't?
No More Two Phase Commit
Why do most distributed databases need 2PC? Because a transaction is likely to be executed by different partitions together, and it's possible that most partitions succeed but one partition fails, in which case the entire transaction should be aborted, and the other successful partitions need to rollback. 2PC ensures that when a transaction can commit, all partitions commit together, and when a transaction fails, all partitions abort together.
But 2PC performance is terrible. As its name suggests, it needs to send requests to all machines, for twice, which is a significant overhead. Moreover, if execution is slow from one partition, 2PC will assume the partition has failed, causing the entire transaction to be aborted. Many distributed database vendors have tried various ways to optimize this, such as Cockroach DB using parallel commit to optimize the overhead of 2PC. However, if 2PC could be completely avoided, it would be another level of performance improvement for the entire system.
Calvin's system design completely avoids using "communication" to determine whether each partition can commit. Intuitively, in Calvin, each partition executes transactions in a deterministic way, succeeding or failing together, so there's no need to use 2PC to determine whether to abort. To achieve this, Calvin needs to make aborts deterministic. Generally speaking, there are the following reasons why a transaction might abort:
- The user explicitly specifies in the transaction that it needs to abort
- The SQL code has bugs/fails the constraint checks
- Multiple transactions conflict and need to abort
- If the transaction involves multiple partitions, but some service nodes crash or get stuck
The first and the second reason is hardcoded in the user's transaction logic, so it's already deterministic. Calvin only needs to have each node execute the same transaction logic. If there's an abort, all nodes will abort together; if it's a commit, all nodes commit together.
The third and forth situations are non trivial because transaction conflicts and service node failures are essentially unpredictable. So Calvin came up with some clever tricks:
For transaction conflicts, Calvin's approach is to avoid them fundamentally. Its transaction management has the following characteristics:
- All partitions must process transactions in the same order
Calvin has a dedicated sequencing layer to sort all transaction requests within a period of time, so partitions just need to process transactions according to this order
- Concurrency control must be deterministic
First, a transaction must acquire all necessary locks before execution. Before executing a transaction, Calvin analyzes which data the transaction will read and write, then acquires all locks in advance. The locks are only released after execution is complete.
Second, the order of lock acquisition is completely consistent with the transaction ordering. Suppose Tx1 and Tx2 want to acquire the same lock, and Tx1 is before Tx2, then Tx1 will definitely acquire this lock first, and then Tx2 can acquire it.
Calvin's transaction management is single-threaded, scanning transaction lists from front to back and giving locks to transactions that need them one by one. If there's a lock conflict, it puts the transaction in a waiting list (the order in the waiting list also follows the original transaction ordering), and gives the lock when it's released. We can clearly see that this approach prevents deadlocks, and also ensures the execution order of transactions is deterministic. Each node will definitely execute these transactions in the same order. This completely avoids aborts due to transaction conflicts.
For server node failures, Calvin's approach is simply to let them fail, while other nodes' transactions commit or abort as needed. When the failed node recovers, it simply re-executes all transactions. The key point is that because all transaction executions are deterministic, the final state of that recovered node will remain consistent with other healthy nodes: if a transaction aborted on other nodes, it will also abort when this node re-executes it, vice versa.
Through these clever tricks, Calvin can ensure ACID for distributed transactions without needing 2PC. Calvin is not only successful academically, but also very successful in industry. Fauna DB is a NewSQL database built on Calvin's, and it has been operating very successfully in several years.
🧐My Personal Opinion
All the systems discussed earlier are very successful, which indicates that they have tried their best when facing different trade-offs. Therefore, it’s pointless to analyze them simply in terms of "good" or "bad". I think it's more meaningful to analyze the reasons why these systems adopting different approaches to ensure consistency.
In the field of consistency, NewSQL can be roughly divided into two major “parties”: Spanner and Calvin. Let's first talk about Spanner. I think Spanner’s approach is certainly the most intuitive, because before that, MVCC had already become mainstream in concurrency control. Naturally, when facing consistency issues, people would think about whether they could also be creative on timestamps. Thus, Spanner shows up. Spanner is a very successful commercial database with great influences: Spanner's paper has been cited over 2000 times and is a must-study in all NewSQL fields. However, Spanner has a fatal problem: it relies too much on Google's internal data center hardware, making deployment very inflexible and losing many potential customers who care about data security (mainly in the financial industry). This left opportunities for other NewSQL databases.
Cockroach DB is currently one of the leading vendors in NewSQL. It cleverly uses a "retry" mechanism to solve the consistency problem, making it completely independent of any special hardware, and deployment becomes flexible and convenient. But Cockroach DB also has a fatal problem: it does not provide strongest consistency. One of Calvin's authors, Daniel Abadi, in his blog criticized Cockroach DB and Yugabyte DB, saying they are not truly consistent databases and are trying to fool customers. When arguing with Daniel Abadi, the CTO of Yugabyte DB doesn't directly answer any questions about consistency. Instead, he talks a lot about the database's Serializability, which is a completely different concepts. I believe that the consistency model of Cockroach DB and Yugabyte DB is good enough for most clients, but the lack of the strongest consistency guarantee more or less affect sales. It requires a lot of effort to explain to consumers what's really going on, and whether their "not perfect" consistency model will affect the client's business.
TiDB and OceanBase cleverly avoid the problem completely by directly using a centralized timestamp service to provide the strongest consistency guarantee. They don't need to spend effort teaching clients about "distributed" systems like Cockroach DB does. And it's not complicated in engineering terms, which avoid all sorts of strange bugs. I had a lecture by Charlie Yang, the technical lead of OceanBase, where he specifically pointed out that Cockroach DB hasn't achieved consistency. I guess when TiDB and Oceanbase is selling their services, they will also intentionally or unintentionally mention that their competitors like Cockroach DB and Yugabyte DB lack consistency, while their databases can provide the strongest correctness guarantee.
Calvin is a very standard and academically highly successful database. However, apart from Fauna DB adopting its architecture, Calvin is not particularly popular in today's commercial NewSQL databases. Why do most database systems designs prefer Spanner architecture rather than Calvin’s? I think there are two reasons:
First, as a successful commercial database, Spanner has already proven to be feasible in industry at large scale. New database vendors don't need to “experiment” when following Spanner’s architecture.
Second, when pitching to investors and customers, they only need to say "We are very similar to Spanner, but we're open-source, more flexible to deploy, and have xxx advantages," and investors and customers will immediately understand. Spanner is made in Google, so following Google can't be wrong.
Adopting Calvin's architecture might get various engineering problems, and it would require more time to explain to customers and investors what exactly they're doing, what's brilliant about their architectural design, and what differentiate them from Spanner. I think that also explains why commercializing academic databases is so hard: industry system designers tend to be extremely conservative when building systems. Taking ideas that are only academically successful risks a lot. They ultimately focus on solving the problems for their customers, and they will choose the architectures that is proven to be successful and reliable at scale in industry.
This is the end of this article. I've referenced many papers and blogs, with links attached below. If there are errors you found, or if you have any thoughts, feel free to discuss with me.
[1] J. Corbett et al., “Spanner: Google’s Globally Distributed Database,” ACM Trans. Comput. Syst, vol. 31, no. 8, 2013, doi: https://doi.org/10.1145/2491245.
[2] Murat Demirbas, M. Leone, Bharadwaj Avva, Deepak Madeppa, and S. Kulkarni, “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases,” Jan. 2014.
[3] https://zhuanlan.zhihu.com/p/462398795
[4] S. Kimball and I. Sharif, “Living without atomic clocks: Where CockroachDB and Spanner diverge,” Cockroach Labs, Jan. 27, 2022. https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
[5] D. Abadi, “DBMS Musings: NewSQL database systems are failing to guarantee consistency, and I blame Spanner,” DBMS Musings, Sep. 21, 2018. https://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html (accessed Mar. 19, 2024).
[6] D. Eeden, “TimeStamp Oracle (TSO) in TiDB,” docs.pingcap.com. https://docs.pingcap.com/tidb/stable/tso (accessed Mar. 19, 2024).
[7] Haitao Gao, “TiDB’s Timestamp Oracle - DZone,” dzone.com. https://dzone.com/articles/tidbs-timestamp-oracle (accessed Mar. 19, 2024).
[8] N. VanBenschoten, “Parallel Commits: An atomic commit protocol for globally distributed transactions,” Cockroach Labs, Nov. 07, 2019. https://www.cockroachlabs.com/blog/parallel-commits/ (accessed Mar. 19, 2024).
[9] D. Huang et al., “TiDB: A Raft-based HTAP Database,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3072–3084, Aug. 2020, doi: https://doi.org/10.14778/3415478.3415535.
[10] A. Thomson and D. J. Abadi, “The case for determinism in database systems,” Proceedings of the VLDB Endowment, vol. 3, no. 1–2, pp. 70–80, Sep. 2010, doi: https://doi.org/10.14778/1920841.1920855.