Azure Data Lake - Early lookSolution ·
UPDATE (19-01-2016): Have a look at Azure Data Lake series for more posts on Azure Data Lake.
Ok, this is a super early look at the technology. Azure Data Lake was announced yesterday (September 29th, 2015) at AzureCon (and later blogged about by Scott Gu), it will public preview at the end of the year so there isn’t a tone of documentation about it.
But there has been quite a few characteristics unveiled.
What is a data lake?
I first came across the concept of Data lake reading Gartner incoming trends reports. A data lake is your unprocessed / uncleansed (raw) data. The idea being that instead of having your data stored neatly in a data warehouse where you’ve cleansed it and probably removed a lot of information from it by keeping only what’s necessary by your foreseen analytics, a data lake is your raw data: not trivial to work with but it contains all the information.
Actually, there are 2 distinct services here:
- Azure Data Lake Store
- Azure Data Lake Analytics
The two are sort of loosely coupled although well integrated.
Azure Data Lake Store is a storage service, a sort of alternative to Blob Storage. It features huge storage scale (there aren’t any advertised capacity limits), low-latency for real-time workload (e.g. IoT) and supports any type of data, i.e. unstructured, semi-structured and structured. At this point, it isn’t clear if it’s just a massive blob storage storing files only or if you can really store structured data natively (aka Azure Tables). On the other hand, the store implements HDFS which is a file system… so I guess the native format are files.
Azure Data Lake Analytics, on the other hand, is an analytics service. So far, it seems that its primary interface is U-SQL, an extension of T-SQL supporting C# for imperative programming. It is built on top of YARN (see my Hadoop ecosystem overview) but seems to be a Microsoft-only implementation, i.e. it isn’t Hive.
On top of that we have Visual Studio tools that seems to be mostly facilitating the authoring / debugging of analytics.
The two services are loosely coupled: the store implements HDFS and can therefore be queried by anything that understand HDFS. This doesn’t even mean Azure HDInsight only but other Hadoop distributions (e.g. Cloudera), Spark and Azure Machine Learning.
Similarly, Azure Data Lake Analytics can query other stores, such as Hadoop, SQL, Azure SQL Data Warehouse, etc. .
U SQL is the query language of Azure Data Lake Analytics. In a nutshell it merges the declarative power of TSQL with the imperative power of C#.
Writing image analysis purely in TSQL would be quite cumbersome. It is for that type of scenarios that C# is supported.
U SQL lands itself quite naturally to writing data transformations pipelines in-code. This is very similar to what Hadoop Pig does.
For now I see U SQL as replacing both Hive & Pig in one throw. On top of that, for developers used to .NET & TSQL, it is very natural and productive. This is in stark contrasts to approaching Hive, which is kinda SQL but where you need to learn a bunch of minimally documented clutches (for instance, invoking a generalized CSV parser requires you to reference a Java JAR-packaged component in the middle of the schema definition) or PIG, which is its own thing.
Ok, so, why would you use Azure Data Lake vs Hadoop with Hive & PIG or Spark?
First, as I just mentioned, you’ll be productive way faster with Azure Data Lake. Not just because you know TSQL & C# but because the integration with Visual Studio will beat all Hadoop odd tools any day.
Second, Azure Data Lake offers a killer features for me: it offers you to pay per query. This means you would really pay only for what you use instead of standing up an Hadoop cluster every time you need to perform queries and dropping it once you’re done. Again, this is more productive since you don’t have to worry about the cluster management, but it can also be way more economic.
Also, hopefully Data Lake Analytics is faster than Hive! If we’re lucky, it could be comparable to Spark.
A very exciting service! I think its main strength is the seamless integration of the different pieces than the drastic new concepts.
Doing Big Data today is quite a chore. You have a bunch of different tools to learn to use and string together. Azure Data Lake could drastically change that.
As usual, a big part will be the pricing of those services. If they are affordable they could help democratize Big Data Analytics.
Before the public preview arrives, have a look at the Service page.