Creating a Backup Solution for Cosmos DB using change feedSolution ·
Cosmos DB is Azure native NoSQL database. It has awesome capabilities such as global-distribution, exceptionally high availability, throughput scalability, and much, much more.
As with most NoSQL horizontally scalable databases it doesn’t have the same backup capabilities as mainstream RDBMS systems.
Cosmos DB has an automated backup capability. It is always there and doesn’t affect performance. It has a few weaknesses though:
- In order to restore, we need to contact support ; there are no API to restore a backup
- We need to contact support within 8 hours as retention is very low
- We can't go back in time at an arbitrarily point in time
In order to have more control over backups and restores, online documentation recommends to use Azure Data Factory and / or Azure Cosmos DB change feed.
In this article we are going to look at how we could use the change feed for backups and when it would work and when it wouldn’t.
Change feed 101
Online documentation does a good job at explaining change feed so we won’t duplicate that here.
Change feed Pros & Cons for backups
Let’s look at the pros and cons of the change feed with the lens of backup.
- Time order (per partition)
- Captures both creation and updates of documents
- Captures full state of documents
Time order means we do not need to sort by timestamp. The fact it captures the full state of the document simplifies the process greatly. It means we do not need to reconstruct a document from fragments over time.
This is a great start for a backup solution. Let’s now look at the cons.
In our perspective, there are three broad categories of cons:
- Deleted Documents
- Doesn't capture deleted documents
- Time to live needs to be tracked at both document and collection level
- No snapshot
- Only latest
- Only covers documents
- Not collection settings (e.g. default time to live)
- Not stored procedures
- Not functions
- Not triggers
Deleted documents probably are the biggest issue. It is recommended in general (not just in the context of backup) to either:
- Soft delete documents ; e.g. have an isDeleted field in documents and filter against it on queries
- Use time-to-live at the document level as this is just a document field and will be caught by change feed
This is a good recommendation but could mean substantial changes to the logic of a solution. For this reason, it likely is the biggest show stopper.
There is a lot to be told about not deleting records as a design principle. Greg Young makes a great argument for it using CQRS and Event Sourcing. Unfortunately, in most solutions records do get deleted.
Time to live can be either at collection or document level. Or both, in which case, the document level time to live wins. This makes it non-trivial to predict the deletion of a document in the context of a backup.
Consistency is a subtler problem. We will analyze it in detail in the next section.
Finally, the fact that only documents are in the change feed isn’t a show stopper. But it does add custom logic we would need to write to have a complete backup solution.
When would a backup be inconsistent?
Let’s tackle the consistency problem. As we have stated in the previous section, there are two limitations of the change feed in that regard. First, there are no snapshot. So, while we consume the change feed, changes are pushed that modify the change feed. Second, only the latest change is in the change feed. So, if we create a document and update it twice, only the second update will be in the change feed.
This can very easily lead to inconsistency. Let’s remember the goal of a backup: to take a consistent copy of a collection. Data should be consistent at least within a partition.
Let’s look at the following example. Here we consider only one document. On that document is performed 6 updates: A, B, C, D, E & F. We also do two backups. During the backup we read the change feed from last backup until “now”.
We establish “now” as the time at the beginning of the backup. This way, even if there is a lot of activity during the backup, we will eventually catch up with the beginning of the backup and stop.
We see that during the first backup, the Update C occurs. Now this leads to a racing condition for what is captured in the backup:
- If Update B was captured by the backup (in the change feed) before Update C occur, then B is the captured state
- If Update C occur before the backup captures Update B, then Update C will appear in the change feed
- The timestamp will be after the backup start so we won't capture it
This is worse as we compound backups. Let’s assume that Update B was captured by the first backup and let’s look at the second backup.
Now let’s assume that Update F occurs before we can read Update E in the change feed. Update F having a timestamp higher than the start of Backup 2, it isn’t captured. That means that from the backup perspective, the document still is in the state of Update B. That means the state prior to Backup 1!
This is because the intermediary updates are erased from the change feed as new ones occur.
We can see that this could easily lead to inconsistent state in the backup. Different documents could be captured with different lag. Together they would form an inconsistent picture of what the collection looked like at the beginning of the backup.
For those reasons, Change Feed might look like a very bad backup solution foundation. Before we discard it altogether, let’s look at scenarios where change feed lead to a consistent image of a collection.
When would a backup be consistent?
Here Backup 1 captures Update A which is the state of the collection at the beginning of Backup 1. Similarly, Backup 2 captures nothing on the same document since Update B occurs after the backup started. Again, this is consistent with the state of the collection at the beginning of Backup 2.
Why does it work now?
Basically, we want to have the picture above: updates occurring every other backup.
We want two metrics to be low:
- Document change rate (i.e. # of changes on one document per hour)
- Backup duration (i.e. the time it takes for us to consume the change feed since the last backup)
In extreme cases:
- If documents never changed once created, we wouldn't have conflicts
- Instantaneous backups, we would essentially have a snapshot of a collection at the beginning of a backup
Those two extreme cases help understand the dynamic but aren’t helpful. Instantaneous backups aren’t possible due to inherent latency. Non-changing documents collection is basically a backup itself.
We are left with trying to shorten the backup duration and / or have backup occurring at a faster pace than document changes.
Those two goals aren’t mutually exclusive. On the contrary, frequent backups would lead to shorter change feed. This would mean a shorter time for backup.
In order to have consistent backup we need to take frequent quick backups.
This would be very hard for a collection experiencing a lot of document changes, e.g. IOT. We would argue that backup solutions are likely to fail in those scenarios anyway. The only thing fast enough to absorb a very fast pace firehose type of data stream is Cosmos DB.
Using change feed for implementing a custom backup solution for Azure Cosmos DB is possible. But it isn’t trivial.
First, we need to take care of how we capture deleted documents as change feed doesn’t capture them. This requires the solution to take care of it.
Second, we need to take snapshot of collection settings, stored procedures, functions & triggers separately.
Finally, we need to take backups quickly (i.e. no fancy processing) and frequently.
Although those considerations limit the range of scenarios, it still leaves a lot of scenarios where it makes sense.