This article is part of a series. Click here to see summary and complete list of articles in the series.
Contents
In building a service based on Windows Azure Table Store I see three classes of mistakes I can make that are likely going to make me wish I had some kind of backup/journaling for my Windows Azure Table Store.- Deleting my own tables in production
- Screwing up a schema change
- Screwing up my application logic
1 D’oh! Deleting myself
The Windows Azure Table Service has a nice REST API that contains a nice DELETE method that can be applied to an entire table. In other words, in a single REST command I can nuke a production table. I’m only one misconfigured maintenance script away from severely hurting myself. I suppose one can argue that running an Internet service is a job for adults and that adults shouldn’t have problems like accidentally deleting production tables but hey I’m basically a big scaredy cat and I’d like a bit more cover.
Here is where my first request to Azure comes in. Could we please have undelete? The webpage I previously linked to says:
When a table is successfully deleted, it is immediately marked for deletion and is no longer accessible to clients. The table is later removed from the Table service during garbage collection.
So perhaps we could let folks set a policy specifying how long a table is guaranteed to stick around before being deleted and then add in an undelete method?
While I’m asking for things currently all access to the table store is handled via a single key. So any part of my service that needs access to the table store has a key that lets it do anything, including things it has no business doing, like deleting tables. Again, adults should run services securely and although I can grumble about some defense in depth issues, this single key shouldn’t really matter for data integrity issues. After all, why should there be any code running around issuing DELETE’s if they don’t need to? Still, see previous comment about being a big scardey cat, I wouldn’t mind if there wasn’t some more fine grained access control so that parts of my service that need to interact with the store could only do the things they were supposed to do. (And yes, I know Blobs have basic ACLs, but I’m talking about the table service)
But who’s kidding who? Given that I spent most of the last year working on the AppFabric Access Control Service of course I’m going to whine about Access Control.
In any case, for now, the situation is that if I do something stupid I can seriously hurt myself. Now, admitedly, that’s always true, but I really have no objections to a few safety measures if Azure sees fit to introduce them. But until then I wouldn’t mind some kind of backup to keep me from completely screwing myself if I accidentally nuke a table.
2 Data Migration Failure
When laying out tables in Azure Table Store one makes lots of fun trade offs between things like referential integrity and performance. Over time the world behind those trade offs will change and the tables will need to be redesigned. But that introduces a really rich source of screw ups. Any time data gets moved from one form to another, especially when large bodies of data are involved, a screw up is all but guaranteed. So I’m really going to want to have some kind of backup so that when I get the inevitable user escalation about data corruption I can at least go back to where I was before I got into this mess and maybe bring the user back to some reasonable state.
3 Application Logic Failure
My service receives a command. Based on that command my service performs some series of actions on our underlying Windows Azure Table Store. All is fine and good so long as we don’t screw anything up. Screw ups tend to come in a few basic flavors:
- We delete something we shouldn’t have
- We didn’t delete something we should have
- We transformed the state of the right row in the wrong way
- We transformed the wrong row
To be fair when building a service the bulk of the testing is focused on detecting any logic screw ups that could lead to the previous failures. But any non-trivial Internet scale service is going to deal with an enormous variety of data input and it’s highly unlikely that our tests could ever catch everything. As we say in the Internet business ”When your data set is large enough there are no edge cases”.
So when we figure out that we have a data corruption bug, what do we do? How do we know who was affected? What can we repair ourselves? Do we just throw up our hands and tell our users ”Oh, um... well... you see... we have a problem and you, dear user, are on your own?”
At an absolute minimum I would like to have a command journal that records every command issued against the system. In my ideal world I would journal data retrieval as well as manipulation but in practical terms I can probably only afford to journal commands that change data. If I could build this journal then when I find out about a data logic corruption bug caused by my front end I could at least try to figure out which of my users was likely to be affected by reviewing the journal looking for commands (or combination of commands) that would trigger the bug. It’s not great protection but at least it gives me some potential to give my users guidance when I screw up.
that all makes total sense to me Yaron. ideally there would also be some sort of journal replay mechanism that would enable me to re-run all of my CUD operations from a given point in time for a given partition/row.
-jamie
Isn’t it true for justifying all backups and not just Windows Azure Table Storage?
One interesting thing to notice is that Azure folks are guaranteeing that the data will be written thrice in the same data center (and at a later date they will provide Geo Replication capability). What would be interesting to know is when we delete a particular table, do they mark all three locations for garbage collection? A nice idea would be to provide a recovery mechanism from one of the other two places where my data is written.
Thanks
Gaurav