UPDATE (19-01-2016): Have a look at Azure Data Lake series for more posts on Azure Data Lake.
I wanted to talk a bit about your Strategic Data & the concept of Data Lake (regardless of its implementation).
Nowaday, data is seen less and less as a commodity, as a byproduct of running systems. More & more it is seen as an asset. For some tech giant, data is their lifeblood, their treasure chest.
This is the essence of Infonomics: assigning economic value to information. The 7 principles of Infonomics are:
- Information is an asset
- Information has both potential and realized value
- Information’s value can be quantified
- Information should be accounted for as an asset
- Information’s realized value should be maximized
- Information’s value should be used for prioritizing and budgeting IT and business initiatives
- Information should be managed as an asset
Now Infonomics is an emerging discipline and appears is stark contrast with today’s reality of most Enterprises.
Today’s data strategy
Today, Enterprise data is centered around systems managing them (e.g. ERP, CRM, payroll, corporate CMS, etc.) and is therefore silo-ed within those systems. The data is captured, stored & analyzed within those systems.
This model leverages the strength of each Enterprise system in managing the data it produces, with the intimate knowledge of that data. Of course the major weakness of that model is that data exist in silos which produces various problems ranging from irritants to strategic problems. To name just a few:
- Double entry ; the necessity for a user to capture data in two systems manually, e.g. a customer information in the CRM & the ERP
- Duplication / poor management of data ; Enterprise Systems tend to manage their core data well and their satellite data poorly. For instance, that customer information you entered in the ERP might be duplicated for each order and some data might be concatenated into a ‘free form’ text field.
- Difficulty to reconcile data between systems ; each system is likely going to use their own identification mechanism and it might be hard to reconcile two customer profile in two different CRMs, e.g. as a bank account & an insurance account.
Different strategies can be put in place to mitigate those issues.
A common one is to integrate different systems together, i.e. passing data from one system to another via some interfaces, either real-time, near real-time or in offline batches. This approach helps each system have their own view of the world and keep the illusion that they are the center of your world.
Another strategy is to integrate data outside those systems, either in a Data warehouse or in a Master Data Management (MDM) system. This approach recognizes that no system has a complete view of your strategic data and creates this view outside those systems. It has the main advantage of freeing you from some of the constraints of each system to get a general view of the data.
Now what those data integration avenue have in common is:
- Most of the data life cycle (e.g. historisation, retention) is still managed by your Enterprise systems
- They are very expensive and rigid in nature, often spawning monthly projects to put in place
Now this was all nice and fine until a couple of years ago when internet broke loose and with the constant cost drop of storage created this new world of Big Data.
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision-making, and process automation.
Today, your systems are probably spewing data. As you engage your customers more and more on the web and in the mobile world, you get a was amount of telemetry & other valuable data. You want to reference partner data related to your data, for instance, research data on your customer’s demographics. More & more you find perceived value in information you were discarding quickly yesterday (e.g. logs).
This is where the concept of Data Lake becomes interesting.
A Data Lake isn’t a big data Data Warehouse. It’s a large storage repository of raw data. The term comes from the comparison of bottle water (structured, cleansed & managed data) with… a lake (your raw, unprocessed & un-cleansed data).
Because it is raw data, it doesn’t have all the project cost related to data warehousing or MDM projects. It also doesn’t have the benefits.
The idea is that you migrate all your strategic data into a Data Lake, that is you copy it over. Once there you can later run analytics on it, in discovery mode (small teams, data science) first and later once you found value, in a more structure approach.
The majors advantages of a Data Lake are:
- You store ALL your data, it is your treasure chest
- The economic model is that you invest in processing the data when you choose to dig in and analyse it as oppose to upfront ; this buys you agility, the ability to trial different data science avenue, fail fast on analytics and mine all your data
The idea of keeping ALL your data at one spot to mine it later might seems like a cosmetic change to your Enterprise Architecture, but it is in fact a HUGE enabler!
A good examples are logs. In traditional approaches logs would be mined for some information by querying system logs and extracting some information on it (e.g. aggregate visits) and storing the results in a data warehouse. Comes a change in your business and you need to go back to the original logs to extract more information? Well the logs from 3 months ago are gone and the one you have would require to re-open the data movement orchestration that took months to develop by a team of consultants who have since left the building. Are you sure you need those? Can you justify it?
With a Data Lake, you kept ALL your raw data, all your logs. You can go back and mine information you didn’t use at first to see if the value you suspect is there actually is.
You can gradually refine your data, e.g. using checkpoints, the same way ore is refined in an industrial process. But you can always go back to an unrefined stage and bring more material to the next level.
In many ways, those aren’t new concepts. Typical Data warehouse systems will have staging environments where you’ll keep intermediary representation of your data. The key difference is that typically you would flush lots of data in the name of server constraints. In a Data Lake mindset you would typically have a Big Data approach where those constraints do not exist.
Data Lake in Azure
In Azure, you could use the new Data Lake Store to store both unstructured data (i.e. files) and structured data (i.e. tables). This allows you to store your raw data and then gradually refine it in more structured way without worrying about the size of your data all along. You could use standard blob storage if the size of your data is below 500 TB. You could use the new Data Lake Analytics to process that data at scale, Azure Data Factory to orchestrate that processing and the movement of your data to more operational stores, such as Azure Datawarehouse or Azure SQL Database.
But whatever the actual technical solution you use, the concept of a Data Lake where you keep ALL your data for further processing, can help you realize infonomics on your Strategic data within your Enterprise.