A “20 years” Walk in the Museum of Dynamo

date
Aug 10, 2024
slug
dynamodb_en
status
Published
tags
Database
Distributed System
summary
Disclaimer: This article is neither about dynamo, nor about museums, and has absolutely nothing to do with the Houston Dynamo FC. Over the past 20 years, DynamoDB has gradually become the cornerstone of Amazon's internal services. Whenever there's a need for a database, everyone thinks of using DynamoDB. Why is that? What exactly makes it so good?
type
Tech
lang
en
Disclaimer: This article is neither about dynamo, nor about museums, and has absolutely nothing to do with the Houston Dynamo FC.
notion imagenotion image
About 20 years ago, when there weren't so many choices of databases, and many companies were still using Postgres or MySQL or Oracle or SQL Server, a group of Amazon employees faced a dilemma: the availability of single-machine databases was just too damn poor. Amazon was one of the earliest companies to adopt a microservices architecture, so databases were one of the primary channels for communication and coordination between services. When a database crashed, many services would be affected, and you can imagine those on-call engineers being overwhelmed by the flood of sev2 tickets (every Amazonian knows this is a nightmare). Moreover, as an e-commerce platform, Amazon's losses due to service outages are calculated by seconds. All these troubles are from the database, because single-machine databases, or those with simple Primary-Backup setups, were too fragile. Amazon's infrastructure was unimaginably large, with network packet loss, network partitions, and server crashes happening all the time. Amazon needed a reliable database that wouldn't be affected by any disaster. Besides reliability, scalability was also a consideration for Amazon, as traditional databases could no longer keep up with Amazon's rapid growth. The new database had to solve these two problems, and DynamoDB was an undeniably perfect solution.
Over the past 20 years, DynamoDB has gradually become the cornerstone of Amazon's internal services. Whenever there's a need for a database, everyone thinks of using DynamoDB. This is entirely due to an unwritten rule within Amazon: unless there's a compelling reason to use an SQL database, everyone should use NoSQL, and when people think of NoSQL, they naturally think of DynamoDB. This leads to services of all sizes considering putting their data into DynamoDB whenever there's a need for data storage. This raises several questions worth discussing: Why does everyone like to use DynamoDB? What exactly makes it so good? DynamoDB has gone through several stages of changes over these 20 years, and by examining these changes, we might find better answers. In fact, when Dynamo first came out, many Amazonian didn't quite know how to use this NoSQL database at all.
 

Dynamo: Availability, Availability, and Availability

At the beginning, everyone called this database as Dynamo.
As I mentioned earlier, Dynamo's goal was high availability and high scalability, with everything else being not important: no consistency, support for only simple database operations (get and put), no consideration for isolation, etc. With this idea in mind, its system architecture becomes easy to understand.

Data Sharding

notion imagenotion image
Dynamo's architecture can be abstracted into a very simple ring.
Due to the need for high scalability, Dynamo uses a hashing ring to evenly distribute data across each node. However, normal hashing has a serious problem: if new nodes are added later, all existing data has to be reshuffled, and during this time the system cannot serve new put/get requests, reducing the overall availability. Dynamo uses consistent hashing to solve this problem, which looks like the "ring" above. The idea is very simple: data is distributed around this ring through hashing, with each node responsible for only a portion of the ring. When a new node joins, there's no need to redistribute all the data; only one node needs to give part of its data to the new node. Similarly, if a node needs to exit, there's also no need to redistribute data. But consistent hashing has a serious issues: when nodes change, data migration is only point-to-point, so if the data volume becomes large, migration time can become very long. For example:
https://stackoverflow.com/questions/69841546/consistent-hashing-why-are-vnodes-a-thinghttps://stackoverflow.com/questions/69841546/consistent-hashing-why-are-vnodes-a-thing
If the blue server leaves, all data can only be migrated to the red server. With Amazon's massive data volume, migration time would become extremely long, which affects availability since services cannot be provided during the migration process. Dynamo adopted a very classic variant of consistent hashing: virtual nodes. The idea is also simple: rather than having one server responsible for a continuous chunk on the circle, it's better to randomly scatter the data to various corners of the circle. This way, when the blue server leaves, its responsible data will simultaneously migrate to different servers, which can be done in parallel. Also, data partitioning would be more balanced.
https://stackoverflow.com/questions/69841546/consistent-hashing-why-are-vnodes-a-thing 
As this figure shows, when blue nodes left, data will be redistributed to the remaining colored servershttps://stackoverflow.com/questions/69841546/consistent-hashing-why-are-vnodes-a-thing 
As this figure shows, when blue nodes left, data will be redistributed to the remaining colored servers

Data Replications

Data sharding can only solve scalability, while high availability requires replicating data to different servers, so that if one server goes down, it won't directly make entire system unavailable. Of course, data can't be simply replicated to other random nodes, as this would affect the speed of data recovery after a crash. Dynamo uses a very flexible quorum: when Dynamo receives a get request, it will request N nodes (where N is the number server nodes within one partition), and when R nodes return results, the get operation is considered successful. Similarly, when Dynamo receives a put request, it will request N nodes, and when W nodes return success, the put operation is considered successful. Users can freely set W, R, and N, but W+R must be greater than N to ensure getting the latest value. Suppose a user sets N=8; if put x=10 is only placed in the first 4 nodes, and get x happens to only be able to request from the last four nodes, then the latest value of x cannot be retrieved. Therefore, only by ensuring W+R>N can we guarantee that W and R have an overlap, thus getting the latest values.
As figure shows, values will be put to A, replicating to B, C and D.As figure shows, values will be put to A, replicating to B, C and D.
As figure shows, values will be put to A, replicating to B, C and D.
Under this quorum system, as long as R out of N servers are functioning, get requests can be served; if W out of N servers are working, put requests can be served. However, this level of availability was still far from enough for Amazon’s use cases. For instance, if a worker accidentally damages a cable on the street causing a network partition, the client will only be able to communicate with servers A, E, and F, but not with any other servers. Suppose the client sets N to 3, W to 2, R to 2 (a partition has 3 servers, each put operation needs to write to at least two servers, and each read operation needs to read from at least two servers), and needs to write x=1 to servers A and replicate to B. According to our previous system architecture, Dynamo would become completely unavailable during this network partitions because the client has no way to communicate with B.
 
Situations where network is partitionedSituations where network is partitioned
Situations where network is partitioned
But obviously this is not what Amazon wanted, as they need to pursue ultimate availability They played a little trick here. If the client can't write to B, it will try to write to C. If it can't communicate with C, it will try to write to D... and so on walking through the circle until it finds a normal server. In the example above, the client will eventually find E, and then write x=1 to E. Of course, E is only temporarily holding this data for B. E will periodically attempt to communicate with B to ensure that the data is returned to B as soon as B comes back. Therefore, Dynamo's availability is not limited by the number of machines in the partition; as long as there are enough machines in the entire cluster to complete this operation, it's fine. Amazon calls this operation Hinted Handoff, which I personally consider to be one of Dynamo's best designs.
But there's no such thing as a free lunch; flexibility often comes with a price.

Data Version

Imagine the following scenario: In a Dynamo system with N = 3, W = 2, R = 2, servers A, B, and C are in one partition.
 
notion imagenotion image
client1 writes x = 2 to A and B.
notion imagenotion image
Then client2 wants to write x = 4 to B and C. It suddenly crashes before writing to C after writing to B. Now things get interesting: after client1 and client2's series of operations, the values in servers within the same partition are completely different: A thinks x = 2, B thinks x = 4, C thinks x = 1. Data inconsistency is just a minor issue; the real problem arises when client3 wants to read the value of x — client3 will be completely confused: which one should I trust?
And considering the cases where Hinted Handoff happened, it becomes even more complex.
notion imagenotion image
If there are servers crashed, some put requests may be accepted by backup servers from other partitions, such as E. At this point, not only will the data within the same partition be different, but some backup servers may also have different values. This leads to a system filled with numerous different versions of the same value, making it confusing and difficult to resolve.
Therefore, Dynamo proposes a way to resolve this conflict. A version number will be given to each value as a identifier, so that new and old data can be distinguished. Doesn't this sound similar to the concept in our previous article? It's actually about determining the event order in distributed systems. However, when Dynamo's first paper was published, there weren't as many solutions available (such as Google's True Time or Hybrid Logic Clock), so they used the oldest vector clock as the version for the data. A vector clock is when data passes through a server, it's assigned a timestamp from that server. An array of timestamps will be formed when data go across different servers. By comparing this array, we can determine which data is the most recent. For more specific details, you can check out other articles that specifically analyze vector clocks.
notion imagenotion image
With data versioning, there are many ways to resolve data inconsistencies. Dynamo leaves this method to the users themselves: users can decide how to resolve conflicts. Generally, Last Write Win is adopted, which means the data with the most recent timestamp will be eventually selected.
Vector clock has a well-known issue: it doesn't scale well, as the size of the version array is positively correlated with the number of Dynamo servers. Scalability is certainly one of Amazon's considerations, and their solution is quite straightforward: if the length of the vector clock's version array exceeds a certain size, Dynamo will directly pop off the first version in the array. This is indeed a bit way to straightforward and will certainly cause problems, but the paper states that this hasn't produced any bugs in actual production environment. So, what can I say 🤷‍♂️? As long as it works.
The flexible reads and writes, combined with multiple data versions, already indicate that Dynamo cannot guarantee strong consistency of data (for example, after I put a new value, get might still return the old value). Amazon provides a way for users to resolve data conflicts themselves, which means the data will eventually be consistent, although no one can guarantee exactly how long it will take.

Fast Recovery

If a server is offline for too long, the data becomes very outdated, requiring more time to get the latest data. This process can be extremely lengthy because data needs to be compared one by one to find out which is outdated. Again, this recovery speed is not acceptable for Amazon, so they need a algorithm to quickly identify outdated data. Amazon's solution is to use the Merkle Tree.
The Merkle Tree and Merkle Chain share a similar design. The Merkle Tree can quickly identify differences through layer-by-layer hashing.
 
notion imagenotion image
If we want to find out which specific data is different, we can do so by comparing this tree. If the hash values of the parent nodes are the same, it proves that the two sets of data are completely identical. If the hash values of the parent nodes are different, we recursively compare the two child nodes. This way, Dynamo can quickly locate differences with a complexity of log N.
At the same time, building this tree is not particularly complex. It can be done by hashing data blocks and building up layer by layer, like stacking Lego.

🧐Dynamo - My Personal Take

As one of the earliest large-scale distributed projects, Dynamo brought a lot of new concepts. Some distributed theories, such as consistent hashing, RW quorum, vector clock, and Merkle Tree, were rarely applied directly in such large-scale systems. It provided valuable experience for both industry and academia.
At the same time, Dynamo was also very successful. It supported Amazon's enormous business volume with extremely high scalability and availability. According to their own words, Dynamo served millions of requests during the busiest seasons without any second of offline.
We also found that in most real-world businesses, it's acceptable to sacrifice some "correctness" in exchange for benefits in other aspects. For example, many businesses don't actually need strong data consistency, and vector clocks can be "appropriately" reduced in size without affecting the business.
However, when the Dynamo team tried to promote Dynamo internally to other teams, it becomes super difficult. They found that many people were not very willing to use Dynamo, and instead preferred to use another AWS products, Simple DB, a database service almost as old as S3. (Simple DB can still be purchased on AWS today.) Dynamo developers were confused because Simple DB had many obvious disadvantages compared to Dynamo:
  1. Limited scalability: Simple DB partitions could only be up to 10GB in size. The application layer needed to write a lot of complex logic to deal with this limitation.
  1. Poor performance: Every data field was forcibly indexed, which led to poor write performance.
In comparison, Dynamo seems so much better, with high scalability, good performance, and great flexibility. Why did people in real life prefer to choose SimpleDB?
Yeah, Why?

Problems

Software engineers not only have to write code, but also need to be responsible for software’s maintenance, which is a fact that some people overlook. When using a service, one must consider not only the difficulty of development but also the cost of the deployment and maintenance. Dynamo falls short precisely in this aspect. Dynamo is not a managed service. You can't simply use an AWS account to connect to a Dynamo endpoint and call its API. When Dynamo is delivered, it's a binary (it might even be a bunch of source code that needs to be compiled), and users need to deploy it to machines themselves and fix any issues that arise later. Imagine you're on call in the middle of the night, and a data version problem in Dynamo causes the program to crash. You need to look at Dynamo's source code to find the root of the problem - how painful is that! So, the barrier for using Dynamo is very high; you need to be at least a Dynamo expert. Most teams at Amazon don't have people with the time to be patient and learn how Dynamo actually works. Thus, people prefer to use Simple DB, a database that can be easily called with its API using just an AWS account and an endpoint, without the needs of deploying and maintaining.
The second problem is that Dynamo requires programmers to set too many parameters themselves. For example, the N, W, and R mentioned earlier, as well as how to resolve conflicts. Scaling up and down also needs to be done by the programmers themselves. If these parameters are not carefully chosen, the performance will not always be desired.
In fact, from today's perspective, databases are mostly cloud-based as a managed service like Simple DB. Users rarely spend a lot of time considering deployment and maintenance. But around 2008, cloud computing had not booming yet, and databases were seldom considered as a cloud service.
Most programmers just want a service, not a binary or a bunch of source code.

DynamoDB: Being a Managed Service on Cloud

A managed service is different from a binary; it's a higher level of abstraction. As a service, DynamoDB hides more details from programmers, which also gives it more freedom to do certain tricks, such as the multi-tenant model.

Multi-tenant Model

Previously, once Dynamo was deployed, it could only be used by a single customer. However, when moving to the cloud, multiple users can use it together, which is known as the multi-tenant model. For example, Dynamo was like having dinner at home, where generally only family members eat food. But when DynamoDB moved to the cloud and became a service, it's like becoming a restaurant where different people eat together. This is certainly different. For instance, in a restaurant, several tables might order the same dish, but the restaurant must serve them on different plates. At the same time, the restaurant must ensure that each customer can only eat the food on their own table and not from others' tables. The chief in restaurant can also play some tricks, such as cooking dishes for multiple tables at once.
To put it in more programmer-friendly terms: If you purchase a MySQL instance on AWS, AWS won't create a new EC2 instance and put the MySQL binary in it for you. Instead, it's possible that multiple users are sharing a single MySQL instance, and AWS uses various methods to isolate each user, giving you the illusion that you're the only one using MySQL.
So it's not hard to understand why DynamoDB requires significant architectural changes. In fact, according to the authors, DynamoDB's architecture is completely different from Dynamo's. The old Dynamo architecture is not for a multi-tenant model. Imagine, the old Dynamo architecture is a ring, and due to its unique availability, all data will be randomly placed on the ring. If it were to become multi-tenant, it would make data management extremely messy and maintenance would be very difficult.
https://www.usenix.org/system/files/atc22-elhemali.pdf Dynamo DB new architecturehttps://www.usenix.org/system/files/atc22-elhemali.pdf Dynamo DB new architecture

Dynamo DB’s Goal

AWS also wants to sell DynamoDB to external users, so DynamoDB must be able to meet the requirements of more users:
  • DynamoDB can scale infinitely: Regardless of the size of your business, DynamoDB can excellently handle all operations.
  • DynamoDB's performance must be stable and predictable: All operations come with latency guarantees. Users won't be “surprised” by sudden high latency.
  • DynamoDB must be highly available: Guaranteeing at least 99.99% availability.
  • DynamoDB flexibly adapts to different businesses: DynamoDB supports not only eventually consistent operations but also strongly consistent operations. So it can be used for businesses with low consistency requirements, like social networking, as well as for businesses requiring high consistency, such as distributed locks.

New Architecture

Although DynamoDB's architecture has been redesigned, there's nothing particularly special about it. In fact, they've adopted the partition + Paxos (Raft) architecture that has countless success stories in the industry. For a detailed discussion of this architecture, you can refer to my article on Spanner. This architecture balances scalability and availability, while it’s easy to adapt multi-tenant. Due to the use of Paxos, strong consistency can also be guaranteed, allowing it to serve as many users as possible. DynamoDB even supports transactions because of the partition + Paxos architecture.
Of course, if the client’s business doesn't require strong consistency, read requests can directly served by Paxos replicas, thus improving throughput. The availability of read requests will also benefits from this, as the request can succeed as long as there's one surviving replica in the Paxos group.
In later sections, we'll discuss how DynamoDB modifies Paxos to maximize availability at minimal cost.
Finally, I'd like to say that partition + Paxos has become the standard answer for most highly scalable distributed databases. DynamoDB also proves that NoSQL can benefit from this architecture, not just limited to NewSQL.

Provision Dynamo DB Capacity

When we deploy a service ourselves, one question is always non trivial: How many resources do I need to meet the TPS required by business? Although DynamoDB has become a managed service, a similar question still exists for customers: How "big" of a DynamoDB do I need for my business needs? DynamoDB cannot directly expose the number of physical nodes as a parameter for customers to choose, so it needs a method for customers to quantify capacity.
The first thing that comes to mind is to quantify capacity using throughput. However, DynamoDB supports many operations, and different operations cost different computation resources for Amazon. It's not practical to set a throughput for each operation. So DynamoDB's solution is to abstract a new concept called read capacity unit (RCU) and write capacity unit (WCU), and different operations have different capacity units. For example, a 4KB eventually consistent read is worth one RCU, and an 8KB write is worth two WCUs. This way, customers can easily calculate the size of DynamoDB they need.
This quantification also helps DynamoDB plan how to store customers’ data. DynamoDB has the following characteristics:
  • Data is partitioned, so capacity is evenly distributed to each partition. For example, if a table is set to 3000 RCU and has 5 partitions, each partition will have 600 RCU.
  • The database uses a multi-tenant model, so a physical node may contain data partitions from different customers.
Through the capacity estimates provided by each customer, DynamoDB can very evenly place different customers' partitions on the same storage node without overloading it. But this allocation causes several problems for customers
  • Hot partitions: A lot of businesses have unbalanced workloads, and it's easy for some partitions to be extremely hot with too many requests while others are underutilized. Overheated data will cause insufficient capacity for that partition, even though the overall capacity has not exceeded the total capacity set by the user.
  • Partition capacity dilution: Because DynamoDB decides partitions based on data size, if a partition's data volume is too large, it will trigger automatic partitioning and subdivide that area. Suppose that partition has 600 RCU, after subdivision it will only have 300 RCU, so the capacity becomes diluted.
The consequence of insufficient capacity is simple: requests will be directly rejected. After all, Amazon needs to make money too. If you want more capacity, you have to pay for it. But this causes customers to have an illusion of service unavailability due to insufficient capacity, even when DynamoDB is available. Customers are certainly unhappy, and Amazon is known for its customer-obsession, so they must serve the customers well!

Capacity Burst

Amazon's first move is to allow capacity bursts for requests for short period. It's unlikely that every customer will use their data partition's capacity limit simultaneously, because a physical node contains data partitions from multiple customers. There's likely free resources on a physical node to allow sudden bursts of requests from some customers’ partitions.
DynamoDB's specific approach is as follows:
They established a Global Admission Control. This control service uses token buckets to record each request. As long as it doesn't exceed the overall capacity limit that customer bought, the GAC won't block the request. At the same time, each physical node also stores such a token bucket to ensure the physical node isn't overloaded. So even when hot partitions occur, DynamoDB can temporarily "borrow" capacity from other partitions to help process requests from these hot partitions.

Splitting for Consumption

Capacity bursts can only solve temporarily overheated partitions. If some partitions are consistently overheated on a large scale, for example, occupying 80% of the total capacity, then the customer's overheated requests will still be rejected. So Amazon's second move is to automatically split the partition based on consumption. If DynamoDB finds that a partition is too hot, it will split this partition into two smaller partitions, by choosing the optimal key (generally based on observing the distribution of requests) to ensure that the capacity of the two small partitions is similar. This way, the two small partitions will be on two different machines, and can better utilize these different machines for capacity bursts.
notion imagenotion image

Balancing Consumed Capacity

Due to DynamoDB's multi-tenant nature, a single physical node can host partitions from multiple customers. Allocating partitions to physical nodes is also a challenge. The paper doesn't mention exactly what algorithm Amazon uses for allocation. Since Amazon has various customers’ metadata, such as partition capacity and size, theoretically all these factors need to be considered. At the same time, the capacity of a physical node needs to be larger than the sum of all partition capacities on that node, so that it can better handle capacity bursts and provide the best experience for customers. If DynamoDB finds that a node's capacity is almost exhausted, it will select a customer's partition for reallocation to prevent service unavailability.

On-Demand Provisioning

After implementing so many capacity related features, the DynamoDB team realized that users don't actually need to set a specific capacity at all. They provided an option called on-demand provisioning: the database automatically scales capacity up or down based on the user's request volume. DynamoDB will estimate a capacity according to the customer's current request volume, and if TPS suddenly increases, it will manage such burst requests through GAC and automatic partitioning.
If TPS increases consistently, then DynamoDB needs to scale up horizontally. Scaling up usually takes a few minutes, so during that time requests may fail or experience delays, but scaling doesn't happen too frequently. DynamoDB adopts a doubling rule: the capacity it automatically provides will be twice the customer's current capacity. As long as the increase in customer requests is less than double, scaling up won't be triggered.
Customers absolutely love the on-demand provisioning feature. From my own experiences using DynamoDB - I never did any calculation on the TPS needed for my DynamoDB table at all. Generally, I just choose on-demand provisioning and use the DynamoDB out of the box, saving a ton of time.

Data Availability and Persistent

From DynamoDB's perspective, availability remains the most crucial aspect. DynamoDB is sold to external customers, and availability needs to be quantified, and customers need to be informed about the meaning behind those availability numbers.
DynamoDB adopts the industry-standard approach: measuring availability using nines. Depending on the configuration, DynamoDB can guarantee either four nines or five nines (99.99% or 99.999%) of availability, which means that within a month, the database can only be down for a maximum of 4 minutes (99.99%) or 26 seconds (99.999%); otherwise, Amazon would have to pay fines. When money is involved, the topic becomes incredibly serious.

Server Crashing

Theoretically, in Paxos, you can infinitely increase availability by simply adding more machines. This is obviously a bad approach, because it not only greatly increases costs, but also increases the latency of each write request, as consensus needs to be reached on more machines. This is why each DynamoDB Paxos group only has a small number of machines (I guess about five?).
A small Paxos group with five machines might be too fragile. If three machines crash, the entire partition goes down. And recovering a machine takes several minutes, meaning if such an incident happens twice in a month, Amazon would have to pay fines. DynamoDB manages thousands of Paxos clusters, such incidents should be considered as norms rather than exceptions.
The solution is to quickly recover when such unfortunate events happen. I personally think this is how the industry solves problems. Unlike in academics where people always solve problems at their root and once for all, the industry only needs to find ways to minimize the impact of problems.
Let's first consider what services would be affected if two machines in a Paxos group crashed.
  • Eventual consistent read (not affected): Due to its eventual consistency nature, as long as one machine in the Paxos group is functioning, the read can return successfully.
  • Consistent read (affected): Because it's strongly consistent, completing this request requires initiating a Paxos quorum. Now that more than half of the machines in the group have crashed, the request becomes unavailable.
  • Write: (affected) All writes are strongly consistent, so this request is also unavailable, for the same reason as above.
We can see that any strongly consistent operations becomes unavailable. The reason is the inability to initiate a Paxos quorum when more than half of the machines in the group have crashed.
First, DynamoDB is backed by AWS, which theoretically has an unlimited number of virtual machines. If one member of the Paxos group crashes, we can always start a new machine. After restoring all of the previous logs onto this new machine, Paxos group can resume operations with this new member joining. DynamoDB also adopts this approach, but the problem now is that recovering a Paxos member is non trivial.
  • Too many logs. Copying a large number of logs to another machine consumes a lot of time, and requests cannot be served during the copying process, so availability is reduced
  • Need to reconstruct the B-tree index: While recovering the logs, the Paxos member’s state machine's index must also be reconstructed to serve requests normally
DynamoDB's solution is also straightforward
  • Every so often, Paxos members back up all current Paxos logs to S3, then deletes them locally. This way, only the most recent logs since the last backup will be on the machines. During recovery, only the latest logs need to be copied, which can be super fast.
  • The newly joined machine takes on a role similar to a learner in Paxos, only storing logs without serving any requests, thus eliminating the need to reconstruct the B-tree index. This newly added machine is called a log replica.
This way, DynamoDB can quickly recover a failed Paxos group in a matter of seconds. Of course, according to the paper, they take an even more cautious approach: as soon as one machine in the Paxos group fails, they immediately start a new log replica. This essentially provides customers with a "always on" experience.
The newly joined node for fast recovery https://www.usenix.org/system/files/atc22-elhemali.pdfThe newly joined node for fast recovery https://www.usenix.org/system/files/atc22-elhemali.pdf
The newly joined node for fast recovery https://www.usenix.org/system/files/atc22-elhemali.pdf

leader Crash Detections

In Paxos, if a member feels that the leader might have crashed, it will initiate a new election. During the election process, the service is unavailable. While maintaining DynamoDB, Amazon discovered that most of these re-elections were false positives, meaning the leader hadn't actually crashed, but due to network latency, members hadn't received the leader's "heartbeat" message. This caused a large amount of unnecessary downtime. DynamoDB slightly modified Paxos: before a member triggers a re-election, it needs to contact other members to see if the leader has truly crashed. This greatly reduces false elections and increases availability.

Quantifying Availability

As mentioned earlier, availability is described using nines, and through availability, we can roughly know how many minutes of service unavailability there can be at most in a month. However, DynamoDB is a massive distributed system with many services. It's possible that some services may go down while others remain available. This makes it difficult to quantify its availability.
Amazon calculates DynamoDB's availability every five minutes, using (successfully completed requests) / (total requests) to compute. When availability drops to a critical value, a sev2 ticket will be cut, and DynamoDB programmers will intervene. At the same time, the system also generates availability reports specific to each customer. Customers can clearly see the quality of service they have purchased.

Maintenance

Maintaining such a large distributed system is certainly not an easy task. Many of DynamoDB's operational experiences come from lessons learned through a lot of accidents, which is painful but valuable.

Dependencies

As a highly available service, choosing the other services it depends must be very careful. For example, DynamoDB depends on AWS IAM and AWS KMS to manage database permissions, but these can in turn affect DynamoDB's availability. DynamoDB needs to continue functioning normally even when these dependencies are unavailable.
Therefore, Amazon avoids putting IAM and KMS on the critical path of the database, and instead references these dependencies indirectly. DynamoDB periodically caches data from these services, so that even if these services become unavailable, DynamoDB can still serve customers using the cached data. Moreover, using caches greatly reduces the latency of calling these services. This is also reduce the burdens for these services that DynamoDB depends on, because the scales of DynamoDB might overwhelmed them. The only issue with using caches is the data might not always be updated, but DynamoDB has specifically designed how they use this data. Using slightly outdated data doesn't significantly impact the actual production environment.
🧐My Personal Take: In the past, when making my own or viewing other people's systems design, I was always too conservative about whether depends on other services. If they crashed, or if their upstream services crashed, it would lead to a chain reaction ultimately causing big service outages. From DynamoDB's experience, we can see that as long as you have a good backup plan for when dependent services crash, the impact on main critical service will be limited.

Deployment

Deploying a giant distributed systems is sophisticated. For a system of DynamoDB's scale, even the tiniest error can be magnified. Deployment must be approached with extreme caution. First and foremost, when deploying a new feature or patch, it's certainly not feasible to deploy to all machines simultaneously. Consider this: DynamoDB operates in over 100 data centers. If all machines were to deploy at once and the code had issues, it would result in failure across more than 100 data centers, impacting thousands of customers. Therefore, DynamoDB must be deployed in stages to ensure that any problems with new code can be quickly rolled back, minimizing impact.
However, staged deployment also presents challenges. In a distributed system where some machines are running new code while others are still on the old code, communication between them are likely to be an issue. DynamoDB employs a read-write deployment strategy. If there are changes to the way machines communicate, code that can "understand" these new communication is deployed first, followed by code that can "speak" in this new way. As an example: if the machines need to switch from communicating in English to Chinese, DynamoDB would first deploy code that allows machines to understand Chinese, and then deploy code that enables them to speak Chinese. This ensures that we don't end up in a situation where some machines have started speaking Chinese while others still can't understand it.

🧐My Personal Opinion

DynamoDB delivers continuously stable service over the past decade. However, I believe this journey certainly hasn't been as smooth as it looks like. Many designs come from lessons and mistakes, such as the service outage in US-East in 2015. Because DynamoDB has a multi-partition architecture, the backend Paxos groups need to constantly pull the latest routing table for the entire database to know the latest data partitioning schemes, such as whether it needs to split or merge with other partitions. This table, in turn, is stored back in DynamoDB itself, creating a circular dependency that can be very inflexible and fragile in some extreme cases. As a result, in 2015, it really did lead to a major outages: DynamoDB had just released the new GSI (Global Secondary Index) feature, and routing tables expand enormously. At the same time, each secondary index is independently partitioned, and for each additional GSI customers set up, the routing tables would double in size. Consequently, in the early hours of that day, due to in stable network, all machines simultaneously attempted to pull the routing tables for their partitions. However, the routing tables were too large, and there were too many requests at the same time, many requests failed completely. Machines were unable to get the latest routing information, so they began to make themselves as offline while retrying. This is not the end of the problem: as we mentioned earlier, if some machines in a Paxos Group go offline, DynamoDB will immediately start new machines to join the Paxos group. These new machines would also try to pull the routing table. The offline machines kept retrying, while more and more new machines came online and kept requesting, ultimately causing the entire routing service to crash. Over half of DynamoDB customers' traffic is impacted.
I think these were the root cause:
  1. Each machine doesn't need to pull down the entire routing table for each customer’s table; it only needs to pull relevant data.
  1. Storing the routing table in DynamoDB itself is too inflexible. Once an accident happens, it can cause a chain reaction of service failures.
Of course, DynamoDB also has done many things well, such as deploying in batches to ensure only one region is affected, and calculating availability every five minutes to ensure anomalies are detected immediately. After this incident, DynamoDB no longer stores the routing table for itself, but instead created a dedicated in-memory database and uses a Perkle Tree (unfortunately, I couldn't find any information about Perkle Tree online) to optimize query efficiency, allowing machines to efficiently query routing tables relevant only to their own partitions. (As a side note, I personally think storing these database metadata on database itself is generally not a good idea; "outsourcing" to other databases is a more convenient and quick approach, such as how Snowflake stores metadata in FoundationDB https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/)
This was DynamoDB's biggest incident in the past decade, lasting a total of three hours. Prior to this, DynamoDB had been operating with essentially 100% availability.
Apart from DynamoDB's successful maintenance experience, what impressed me most was how DynamoDB improved the developer experience. Using a database is never an easy task. Programmers need to figure out the schema, tune various database parameters, calculate how much capacity the database needs, etc. Many companies have to specifically hire a DBA to do this kind of non trivial work. However, DynamoDB abstracts away all these things for you: DynamoDB doesn't need a schema; DynamoDB can auto-scale, DynamoDB doesn't need deployment. Using DynamoDB can efficiently reduce engineering complexity, giving users the best coding experience.
Finally, the 20-year evolution from Dynamo to DynamoDB also proves that DBaaS is more preferred by most developers comparing to standalone databases. Databases are too important; almost all businesses in the world use them. Databases are also too complex, with SQL parsing, query rewriting and optimization, scheduling, execution, storage, indexing, ACID transactions, WAL, and so on - each area could be a totally different research topic. In an era where development speed is highly important, most programmers do not have time to study every aspect of a database. Therefore, a good database should be able to hide all the complex details and only expose just the right amount of functionality for users. Moving to the cloud and becoming a managed service is an inevitable path.
For any database in the world, if users must dive deep into its source code and carefully study its architecture before using it, wouldn't that implies a bad database design?
 
 
In the end, shout out to Junxi, Youheng, and Haojia for giving me valuable improvement advice on this article.
 
 
Reference:
[1] Dynamo: Amazon’s Highly Available Key-value Store https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[2] Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service https://www.usenix.org/system/files/atc22-elhemali.pdf
[3] Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region https://aws.amazon.com/message/5467D2/
[4] Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications https://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html