I had previously concluded that command journaling (creating a journal of all the external user commands and internal maintenance commands I issue) is really useful for recovering from self inflicted data corruption. In this article I look into the various techniques I can use to implement a command journal so as to trade off between system performance and the journal’s utility in recovery.
This article is part of a series. Click here to see summary and complete list of articles in the series.
Contents
2 Performance versus accuracy
2.1 Single parallel write
2.2 Single serial write
2.3 Adding a second write
2.4 Adding command IDs and versioning
3 Should I include reads in the command journal?
4 Retrieving data from the command journal
1 Command Journal? Isn’t that just a log?
You betcha. All I’m suggesting is that we can use those logs to help recover from self inflicted data corruption. In this article I talk about specific ways of generating that log that make the log more useful for data corruption recovery.
2 Performance versus accuracy
In deciding how to design the command journal I basically have to trade off between the performance overhead of the command journal and accuracy of its contents. Below I walk through a number of designs/features that set the performance/accuracy slider to different settings.
2.1 Single parallel write
The most trivial command journal I can imagine would work in the following way: A request is received and an immediate asynchronous write request is made to the command journal to record the request. Meanwhile, in parallel, the actual command is executed.
The main downside to this approach is false negatives, e.g. saying someone wasn’t affected by a bug who really was. It’s quite possible that the something happened to the write to the command journal but the command was still executed. In that case there would be no record of the command in the journal so if that lost record recorded a command with the bug we wouldn’t know to notify the affected entity.
2.2 Single serial write
I’m generally more concerned with false negatives (telling people they are o.k. when they aren’t) than with false positives (telling someone a bug hit them when it didn’t). My guess is that in any real world scenario where I’m trying to recover from a self inflicted bug the people affected by the bug are going to have to check if the ’fix’ makes sense or not and that will give them a chance to realize that they weren’t really affected by the bug.
One way to reduce my false negative rate is to serialize the front end interaction with the command journal. When a command comes in I would first issue a write to the command journal and only once I got confirmation that the write succeeded would I then proceed to process the rest of the command.
This approach is more expensive then the previous one in terms of time. The time to process all write commands will now increase by the round trip time needed to write to the command journal. But on the positive side I’ll have a lot less false negatives since I’m guaranteed that a record of the command exists. Of course I’ll also have false positives since the command may exist in the journal but never have been executed in real life.
2.3 Adding a second write
One way to reduce false positives is to add a second write where we record the outcome of the operation. This will only help in cases where a bug is directly related to how the command completed (e.g. if a bug only shows up in operations that succeed or only ones that fail in a certain way). By recording data about how a command completed we can then filter down the list of affected users.
But this mechanism, besides adding another round trip and additional expense, is far from perfect. After all, a machine could simply drop dead in the middle of processing a command and unless a mechanism like Azure’s Queues are being used there will be no recovery and we essentially end up back to a single write system in terms of that command.
Still, this approach won’t increase false negatives over the single write systems (e.g. it does no harm) and it can reduce false positives.
2.4 Adding command IDs and versioning
Another technique to reduce false positives is to assign an ID to each command as it comes in. That ID is recorded in the command journal and every write we do to our production data systems will also include that ID. When we investigate the outcome of a command we can look for the ID in our production storage and piece together what actually happened to the command.
In the general case we will need to know not just if the write happened but also what value it wrote (for diagnosing intermittent bugs). So if our production stores aren’t versioned then the IDs won’t be terribly useful. I hope to get to an article on how to implement versioning on top of Windows Azure Table Store.
3 Should I include reads in the command journal?
In theory I should record everything. Every read. Every write. It’s tempting to argue that the reads, in particular, don’t matter. After all I am worried about cases where I am corrupting data and reads shouldn’t cause data to change. Of course, code shouldn’t have bugs and I shouldn’t have to worry about corrupted data either. But in reality code does have bugs and my reads could have potentially done something stupid. Or more likely, caused something stupid to happen.
The simplest example is that there is a bug in the code that reads out values and every once in a while I return the wrong value. If I logged both reads and writes I could potentially figure out which reads were likely to have run into the bug and let specific customers know if they could have been affected.
If I don’t log reads, then I can’t provide extra notice to users most likely to have been affected. Of course I also have to pay the bills and while cold data isn’t terribly expensive it isn’t free. For example, Azure has stated they are going to charge $0.15/GB stored and $0.01 per 10,000 storage transactions. Assuming I’m using Windows Azure compute as my front end I don’t have to pay anything to move the journal commands from my front end machines to storage.
Let’s say I get 1000 read requests/second. That works out to 1000 * 60 * 60 * 0.01/10,000 = $3.6/hour for transaction costs for writing to the journal (assuming a single write model). Lets further assume that every read requires 2048 bytes of journal space. I’m assuming that I only record the fact that the read happened and don’t actually record the response body, hence the 2048 bytes guesstimate for size. In that case we need 1000 * 60 * 60 * 2048 / 1024 / 1024 / 1024 * 0.15 = $1.03 to store the resulting data for one month Let’s assume I’ll keep the data for six months. So the real cost is $1.03 * 6 = $6.18 accrued each hour. So running for a single day would cost ($1.03 + $6.18)*24 = $173.04/day. The previous cost covers the full cost of keeping all the data stored in that day for six months.
Which actually isn’t too bad all things considered.
So maybe I can just journal everything?
4 Retrieving data from the command journal
Writing to the command journal is reasonably straight forward. Interrogating the command journal isn’t necessarily as easy. The amount of data in the command journal can very quickly grow to stupendous heights. The previous example I used of journaling reads was writing out 6.87 gigs or so an hour. That means if we are storing six months worth of data then our journal will have a size in the neighborhood (just for reads!) of 6.87 * 24 * 30 * 6 = 29678.4 gigs or 29 terabytes.
This is pretty typical for this kind of usage log. When it’s time to grub through the journal to find people affected by a bug we need a search platform. Even using SQL Azure won’t magically solve the problem for us since the largest size for a database in SQL Azure is currently 10 Gig, we’ll need a lot of databases and a framework to query and aggregate across those databases. And in Windows Azure Table Store we could issue a single query that would go through the entire command journal but it would take so long to run that we would all but certainly need to break the query up into sections (say by partition key) and then combine the results together. So in both cases we need a way to fan out to make sub-queries and then fan in the results. This is of course the classic Map/Reduce pattern.
This is a pretty trivial version of map/reduce. Essentially we need to partition the search space and then, to use a trivial example, fire off all the queries into a Windows Azure Queue and having different Windows Azure Compute instances hit the queue until it’s empty writing out the results to some result store.
Still, it would be nice if we could use an engine like Hadoop or Dryad to handle all the dirty work for us. Or better yet, maybe Microsoft could make Cosmos available as a publicly available service.
Yaron,
I know you worked on Cosmos and I’ve read your blog on it but I admit I don’t know much about it. How does it differ from Dryad?
-Jamie
Cosmos uses Dryad as its node scheduling system.