Querying Azure Storage

I’ve been reading the Wikipedia article about CouchDB lately.  It always impresses me how much engineering effort and men-hours are put into so many competing platforms.  I guess the least we could do is to look at the architecture of those initiatives and get the best out of them.

CouchDB - RelaxCouchDB is a NoSQL database.  It basically stores JSON documents on potentially many commodity servers in order to scale horizontally.  It supports replication, transactions, a RESTful API, etc.  .  But the bit in the architecture that pique my curiosity was the view engine.

CouchDB sports views on document collections.  Those views are computed by a map reduce algorithm using a given server-side java script functions.  Those views are computed in parallel on the servers.

Well, that’s nice.  I don’t know too much about the reliability implications of having arbitrary code used to computed a view, but I guess you can do as much damage in an SQL view if you put yourself to it.

What I like about that architecture component is that it does respond to a need when you store data:  querying it!  The big question mark for me around Azure Storage has been the querying.  Take the Azure tables:  you have a partition key and a row key.  If you know those, the engine can return you the rows you want in the blink of an eye.  If you don’t but you know the values of other properties (columns), then the engine will do a table scan on all the scaled-out servers.  That sounds a bit extreme to me.

I understand the principles behind it, the Azure tables don’t have schema, the partition key allows you to hint the engine to partition your data, etc.  .  But it seems very low-level to me.  Once I’ve “designed” my table, ie, chosen the partition & row key, I’m locked in that way.  This is where something like CouchDB view engine would come in handy.  Something like an index the engine could use to resolve queries thrown at it.

Actually the CouchDB solution leaves a bit to be desired.  It forces you to know which view to use.  That sounds like the old days of indexes where you had to specify which index to use.

That’s why I would prefer to have an indexing engine sitting next to the table that you could configure separately.

Now, I know that building a distributed index on the scale of Azure Storage (currently maximum 100Tb) must be a daunting tasks with more than one challenge.  But ultimately, the current version of Azure Storage is a bit too “write-only” to be useful in a massive amount of scenarios.

I believe that is part of the storage schizophrenia within Windows Azure, ie SQL Azure versus Azure Storage, there is at the moment.  I couldn’t really quite put my finger on how to fix the Azure Storage and still don’t have a solution, but if an open-project like CouchDB can do it, I’m sure Microsoft can pull it off!

Leave a comment