Tag Archives: Big Data

Any solution where traditional Data Architecture doesn’t work ; Azure Data Lake, HDInsight, Hadoop on Azure & Stream Analytics.

Querying tables in Azure Data Lake Analytics

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this or have a look at Azure Data Lake series for more posts on Azure Data Lake.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In my last post, we did import data from the Social characteristics of the Marvel Universe data set into ADLA tables.  In this post, I will query those tables to get some insights out of them.

Data Model

The data model in the ADLA tables is the following:

image

Popular Character

Obvious question:  which character is the most popular, i.e. which character has been published the most?

Let’s ask the question:

USE DATABASE Marvel;
 
//  Most popular characters
@characters =
    SELECT c.CharacterID,
           c.CharacterName,
           COUNT(cb.BookID) AS PublicationCount
    FROM Input.Character AS c
         INNER JOIN
             Input.CharacterBook AS cb
         ON cb.CharacterID == c.CharacterID
    GROUP BY c.CharacterID,
             c.CharacterName;
 
@charactersWithCount =
    SELECT *
    FROM @characters
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;
 
OUTPUT @charactersWithCount
TO "/Outputs/TopCharacters.tsv"
USING Outputters.Tsv();
 

We get the predictable result:

  • SPIDER-MAN/PETER PARKER
  • CAPTAIN AMERICA
  • IRON MAN/TONY STARK
  • THING/BENJAMIN J. GRIMM
  • THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
  • HUMAN TORCH/JOHNNY STORM
  • MR. FANTASTIC/REED RICHARDS
  • HULK/DR. ROBERT BRUCE BANNER
  • WOLVERINE/LOGAN
  • INVISIBLE WOMAN/SUE STORM RICHARDS
  • BEAST/HENRY &HANK& P. MCCOY
  • SCARLET WITCH/WANDA MAXIMOFF
  • DR. STRANGE/STEPHEN STRANGE
  • DAREDEVIL/MATT MURDOCK
  • WATSON-PARKER, MARY JANE

Popular DUOs

Another obvious question is:  which 2 characters are published together the most?

USE DATABASE Marvel;

//  Most popular duo of characters
@duoCount =
    SELECT COUNT(cb1.BookID) AS PublicationCount,
           cb1.CharacterID AS CharacterID1,
           cb2.CharacterID AS CharacterID2
    FROM Input.CharacterBook AS cb1
         INNER JOIN
             Input.CharacterBook AS cb2
         ON cb1.BookID == cb2.BookID
         // Eliminate duos consisting of the same character
    WHERE cb1.CharacterID != cb2.CharacterID
         // Making sure the same duo will be there only once
         AND cb1.CharacterID < cb2.CharacterID
    GROUP BY cb1.CharacterID,
             cb2.CharacterID;

@sortedDuos =
    SELECT dc.CharacterID1,
           c1.CharacterName AS CharacterName1,
           dc.CharacterID2,
           c2.CharacterName AS CharacterName2,
           dc.PublicationCount
    FROM @duoCount AS dc
         INNER JOIN
             Input.Character AS c1
         ON c1.CharacterID == dc.CharacterID1
         INNER JOIN
             Input.Character AS c2
         ON c2.CharacterID == dc.CharacterID2
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;

OUTPUT @sortedDuos
TO "/Outputs/TopDuos.tsv"
USING Outputters.Tsv();

[/code]

Again, if you know the Marvel Universe, the results make sense:

  • HUMAN TORCH/JOHNNY STORM & THING/BENJAMIN J. GRIMM
  • HUMAN TORCH/JOHNNY STORM & MR. FANTASTIC/REED RICHARDS
  • MR. FANTASTIC/REED RICHARDS & THING/BENJAMIN J. GRIMM
  • INVISIBLE WOMAN/SUE STORM RICHARDS & MR. FANTASTIC/REED RICHARDS
  • HUMAN TORCH/JOHNNY STORM & INVISIBLE WOMAN/SUE STORM RICHARDS
  • INVISIBLE WOMAN/SUE STORM RICHARDS & THING/BENJAMIN J. GRIMM
  • SPIDER-MAN/PETER PARKER & WATSON-PARKER, MARY JANE
  • JAMESON, J. JONAH & SPIDER-MAN/PETER PARKER
  • CAPTAIN AMERICA & IRON MAN/TONY STARK
  • SCARLET WITCH/WANDA MAXIMOFF & VISION
  • ANT-MAN/DR. HENRY J. PYM & WASP/JANET VAN DYNE PYM
  • CYCLOPS/SCOTT SUMMERS & MARVEL GIRL/JEAN GREY SUMMERS
  • STORM/ORORO MUNROE Subscribe! & WOLVERINE/LOGAN
  • CAPTAIN AMERICA & THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
  • CAPTAIN AMERICA & VISION

Conclusion

We’ve seen how to do simple analytics using USQL on ADLA tables.

The data set I’m using is relatively small but with a Big Data set, the power of Hadoop opens lots of possibilities.

You can explore your data until you find some queries that make sense.  This means you can very easily explore big data set without provisionning servers or even VMs.

Network Access Control on an HDInsight Cluster

In this post, I’m gona try to show how you can use an Azure Virtual Network with a Network Security Group to control the access (at the network level) to an HDInsight cluster.

For a primer on both those technologies, please refer to my Using Network Security Groups (NSG) to secure network access to an environment post.

The main caveat I would add is that in that post I was using Resource Manager virtual network (also known as v2) while HD Insight, at the time of this writing (late January 2016) supports only v1, or classic Virtual Network.  Everything else is the same except I won’t be able to ARM template it at the end.

UPDATE (05-02-2016):  Classic (v1) Virtual Network is required for a Windows Cluster, which is what I do in this article.  For a Linux Cluster, a Resource Manager (v2) Virtual Network is required.  This article remains valid ; you only have to create your Virtual Network with “Resource Manager” option.  See this documentation (that has been updated since I wrote this post) for more details.

The problem

Azure HDInsight is basically Hadoop cluster as a service.  You can stand up a cluster in minutes and get on with your Big Data jobs.  Better, you can externalize the data so that you can destroy the cluster and stand it up days later and continue with the same data (see my post about how to setup HDInsight to externalize data).

Now HDInsight is a public in nature.  So all endpoints on it are open on the internet.  Each endpoint use authentication but for some customers that is not enough.

That is understandable since most companies use Hadoop to analyse business data, sometimes sensitive data.  Therefore having more control on access is mandated.

Virtual Network

The obvious way to have more control on Network access is to start by attaching a Virtual Network on your HDInsight cluster.

Let’s first create a Virtual Network.  One of the critical step is to choose the “Classic” Model as opposed to “Resource Manager” Model.

image

Give you v-net a name.  You can leave the address space as the default, i.e. 10.0.0.0/16.  Put it in the same resource group as your HDInsight cluster (for convenience) and in the same region (that’s mandatory).  Then create it.

You now have a Virtual Network you can put an HDInsight cluster in.

Attaching Virtual Network to cluster

Now this is required to be done at the cluster’s creation time:  you can’t add a virtual Network afterwards (or change its subnet).

So when you create your cluster in the Azure portal, go in the Optional Configuration at the bottom.

image

Then select the Virtual Network box

image

Choose the Virtual Network you just created.  For the subnet, choose the default subnet.

When you’ll create your cluster, it will behave identically to a cluster without a Virtual Network.  The only way to know there is a Virtual Network attached to it is to look at your cluster’s settings

image

and then look at its properties

image

At the bottom of the blade you should see the Virtual Network GUID.

image

A GUID, salt of the Earth!

VPN Gateway

Now that you have a Virtual Network you can connect it to your on-promised network and block internet access altogether.

This is probably your most secure option because in many ways it is like the cluster is running on-premise with the benefice of the cloud.

I won’t cover VPN Gateway in this post.

Network Security Group

Now the Cluster is in a Virtual Network, we can control access via a Network Security Group.

First, we’ll create a Network Security Group (NSG).

image

which should lead you to

image

Again, make sure you select “Classic” deployment model.

Give it a name and make sure it is in the same resource group and region as your cluster.

Next we’ll attach the newly created NSG to the default subnet of the Virtual Network.  NSGs are independent entities and can be attached to multiple subnets (on potentially multiple Virtual Networks).

Let’s open the virtual network and then open its settings.  From there, open its subnets and select the default one (or whichever you put your cluster in).

image

From there, select Network security group and then select the NSG you just created.

This binds the subnet to the NSG.

Configuring NSG rules

At this point your cluster shouldn’t be accessible.  This is because by default, NSGs disable most routes.  To see that, open your NSG and then, in the settings, open the Inbound security rules.

image

You should have no rules in there.  Click the Default Rules at the top and that should display the default rules.  Basically, the NSG allows connections from within the virtual network, connections from the Azure Load Balancer (this is required so that VMs can be monitored internally) and denies every other routes.

This means, among other things, that the RDP route (or SSH for a Linux cluster) is denied from the internet.

Similarly, if you look at the outbound rules:

image

Here, routes toward the virtual network & the internet are allowed but nothing else.

So this is very secure as nothing can get in!  Actually, that would be perfect for the scenario where you connect the Virtual Network to your on premise network (via an Azure VPN Gateway) since then connections from your network would get in.

Now, let’s add a rule to allow traffic coming from your laptop.

First, let’s determine what the IP of your laptop is, by using, for instance, https://www.whatismyip.com/.

Then, let’s go back to the inbound rules of your NSG and let’s add a rule:

  • Name:  Allow laptop
  • Priority:  500 (anything between 100 and 64999 really)
  • Source:  CIDR block
  • Source IP address range:  <the IP of your laptop>/32 (the /32 makes the IP you specify the only enabled IP)
  • Protocol:  TCP
  • Source Port Range:  *
  • Destination:  Any
  • Destination Port Range:  *

Once the rule has been saved and the NSG updated (it usually takes less than a minute), you should be able to access your cluster (e.g. RDP / SSH, HTTPS to dashboard, etc.).

In practice you would specify a larger IP range corresponding to the outbound IPs of your organization (or department).

Now this will open all the ports of your clusters to the specified IP range.  You would be tempted to enable port by port, but there is some port mapping (for instance for RDP) happening before traffic hits the Virtual Network that forbids that approach to be effective.

Conclusion

We’ve seen how to lock down an HDInsight cluster:

  • Create a Virtual Network
  • Associate the Virtual Network to your HDInsight cluster at creation time
  • Create a Network Security Group (NSG)
  • Associate the NSG to your cluster subnet
  • Add access rule to the inbound rules of the NSG

If you tear down your cluster, you can keep the virtual network & associated NSG around.  This way, next time you stand up your cluster, you can simply associate the virtual network and get all the network rules back.

Analyzing Web Logs with Azure Data Lake Analytics (ADLA)

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this or have a look at Azure Data Lake series for more posts on Azure Data Lake.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In this post I decided to analyse web logs at scale.  This is complementing the excellent documentation on how to read logs from one file.  Here we’ll read through multiple log files.

Here I’m going to read web server logs (IIS logs) from an Azure Web sites.  I’ll suppose you have a web site and you configured web server logging on blob storage.

Hooking on the blob storage

First thing to do, we’ll add a new data source to our ADLA account.  This will allow us to parse the files directly from the blob storage.

In the main pane of ADLA, click “Add Data Source”.

image

In the “Add Data Source” pane, under “Storage Type”, select “Azure Storage”.

image

Enter the storage account information of the storage account you are using in your web site to store your web server logs.

When done, click the “Add” button at the bottom.

Skipping line problem

Currently skipping lines in an input file isn’t supported.  This is quite unfortunate because log files come with two header rows:

 
#Software: Microsoft Internet Information Services 8.0
#Fields: date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken 

The obvious way to get around this would be to skip the first two lines.  You can see the feature is sort of there in the code documentation of Extractors but you’ll never be able to use it at this point (early January 2016).

This is a temporary shortcoming of the preview and will likely be fixed soon.

Because of this, we’ll have to jump through hoops and do some preprocessing.

Preprocessing Script

Let’s create a U-SQL script.  First, let’s load all the logs:

@lines =
    EXTRACT Line string,
            FileName string,
            Year string,
            Month string,
            Day string,
            Hour string
    FROM "wasb://@.blob.core.windows.net//{Year:*}/{Month:*}/{Day:*}/{Hour:*}/{FileName:*}"
    // Hack in order not to have the extractor delimitate columns
    USING Extractors.Text(delimiter : '$');

First notice that the file path starts with “wasb”.  This is because we aren’t targetting the default data source (the first one) of our ADLA account.  For secondary data sources, we need to specify the source using:

  • For ADLS:  adl://<Data LakeStorageAccountName>.azuredatalakestore.net:443/Samples/Data/SearchLog.tsv
  • For Azure Blob Storage:  wasb://<BlobContainerName>@<StorageAccountName>.blob.core.windows.net/Samples/Data/SearchLog.tsv

Instead of grabing just one file, we grab a file set.  You can read about he rules of how to define a file set here.

Finally, we use my good old trick to get the entire line without parsing columns.  If we would parse columns, it would break with the first 2 lines of each file as mentionned previously.

Then, we’ll filter out those header lines (starting with ‘#’) and reconstruct the blob path for future references:

@logLines =
    SELECT Line,
           "//" + Year + "/" + Month + "/" + Day + "/" + Hour + "/" + FileName AS BlobPath
    FROM @lines
    WHERE !Line.StartsWith("#");

In order to parse the lines, we’ll need some C# code:

namespace MarvelUsql
{
    public static class SpaceSplit
    {
        public static string GetColumn(string line, int index)
        {
            var parts = line.Split(' ');
 
            if(index>= parts.Length)
            {
                throw new ArgumentOutOfRangeException(
                    "index", "Column " + index + " isn't available in line " + line);
            }
            else
            {
                return parts[index];
            }
        }
    }
}

Yes, I keep the old Marvel theme from the other posts.

We can put that code in the code behind of the script we are building.  This simplifies the compilation and assembly registration process.  We can then leverage the custom code in our script:

@logs =
    SELECT DateTime.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 0)) AS s_date,
           MarvelUsql.SpaceSplit.GetColumn(Line, 1) AS s_time,
           MarvelUsql.SpaceSplit.GetColumn(Line, 2).ToUpper() AS s_sitename,
           MarvelUsql.SpaceSplit.GetColumn(Line, 3) AS cs_method,
           MarvelUsql.SpaceSplit.GetColumn(Line, 4) AS cs_uristem,
           MarvelUsql.SpaceSplit.GetColumn(Line, 5) AS cs_uriquery,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 6)) AS s_port,
           MarvelUsql.SpaceSplit.GetColumn(Line, 7) AS cs_username,
           MarvelUsql.SpaceSplit.GetColumn(Line, 8) AS c_ip,
           MarvelUsql.SpaceSplit.GetColumn(Line, 9) AS cs_useragent,
           MarvelUsql.SpaceSplit.GetColumn(Line, 10) AS cs_cookie,
           MarvelUsql.SpaceSplit.GetColumn(Line, 11) AS cs_referer,
           MarvelUsql.SpaceSplit.GetColumn(Line, 12) AS cs_host,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 13)) AS sc_status,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 14)) AS sc_substatus,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 15)) AS sc_win32status,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 16)) AS sc_bytes,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 17)) AS cs_bytes,
           int.Parse(MarvelUsql.SpaceSplit.GetColumn(Line, 18)) AS s_timetaken,
           BlobPath
    FROM @logLines;

Finally, we’ll output the result into a consolidated file:

OUTPUT @logs
TO "/Preprocess/Logs.log"
USING Outputters.Text(delimiter:' ');

This basically concludes the pre-processing of the logs.  We now have them all in one file.  The file might be huge, but thanks to ADLS no storage limit, that is ok.

We could also have stored the logs in a table, which would have accelerated future processing.  See this post as an example of how to do that.

Popular Pages

Now that we’ve pre processed the logs, let’s run some analytics.

Let’s determine the most popular pages:

@logs =
    EXTRACT s_date DateTime,
            s_time string,
            s_sitename string,
            cs_method string,
            cs_uristem string,
            cs_uriquery string,
            s_port int,
            cs_username string,
            c_ip string,
            cs_useragent string,
            cs_cookie string,
            cs_referer string,
            cs_host string,
            sc_status int,
            sc_substatus int,
            sc_win32status int,
            sc_bytes int,
            cs_bytes int,
            s_timetaken int,
            BlobPath string
    FROM "/Preprocess/Logs.log"
    USING Extractors.Text(delimiter : ' ');
 
@popular =
    SELECT COUNT( * ) AS HitCount,
           s_sitename,
           cs_method,
           cs_uristem
    FROM @logs
    GROUP BY s_sitename,
             cs_method,
             cs_uristem
    ORDER BY HitCount DESC
    FETCH FIRST 10 ROWS ONLY;
 
OUTPUT @popular
TO "/Outputs/PopularPages.tsv"
USING Outputters.Tsv();

First query, we schema-on-read the aggregated logs we just created.

Second query, we aggregate the requests per pages, count them, sort the count and keep the top 10.

Third, we output that result to a TSV file.

Hit per day

Similarly, we can check the top days for traffic:

@logs =
    EXTRACT s_date DateTime,
            s_time string,
            s_sitename string,
            cs_method string,
            cs_uristem string,
            cs_uriquery string,
            s_port int,
            cs_username string,
            c_ip string,
            cs_useragent string,
            cs_cookie string,
            cs_referer string,
            cs_host string,
            sc_status int,
            sc_substatus int,
            sc_win32status int,
            sc_bytes int,
            cs_bytes int,
            s_timetaken int,
            BlobPath string
    FROM "/Preprocess/Logs.log"
    USING Extractors.Text(delimiter : ' ');
 
@perDay =
    SELECT COUNT( * ) AS HitCount,
           s_date
    FROM @logs
    GROUP BY s_date
    ORDER BY HitCount DESC
    FETCH FIRST 10 ROWS ONLY;
 
OUTPUT @perDay
TO "/Outputs/PagesPerDay.tsv"
USING Outputters.Tsv();

Conclusion

You can see how quickly we can ingest web server logs and do some simple analytics at scale.

The temporary shortcoming of not being able to schematise the files directly made the solution much more complicated.  With the skip-line feature we will be able to extract all files in one query and aggregate (or whatever analysis) on the second query.

Stategic Data, Infonomics & Big Data

UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

From http://www.realwire.com/I wanted to talk a bit about your Strategic Data & the concept of Data Lake (regardless of its implementation).

Nowaday, data is seen less and less as a commodity, as a byproduct of running systems.  More & more it is seen as an asset.  For some tech giant, data is their lifeblood, their treasure chest.

This is the essence of Infonomics:  assigning economic value to information.  The 7 principles of Infonomics are:

  1. Information is an asset
  2. Information has both potential and realized value
  3. Information’s value can be quantified
  4. Information should be accounted for as an asset
  5. Information’s realized value should be maximized
  6. Information’s value should be used for prioritizing and budgeting IT and business initiatives
  7. Information should be managed as an asset

Now Infonomics is an emerging discipline and appears is stark contrast with today’s reality of most Enterprises.

Today’s data strategy

Today, Enterprise data is centered around systems managing them (e.g. ERP, CRM, payroll, corporate CMS, etc.) and is therefore silo-ed within those systems.  The data is captured, stored & analyzed within those systems.

image

This model leverages the strength of each Enterprise system in managing the data it produces, with the intimate knowledge of that data.  Of course the major weakness of that model is that data exist in silos which produces various problems ranging from irritants to strategic problems.  To name just a few:

  • Double entry ; the necessity for a user to capture data in two systems manually, e.g. a customer information in the CRM & the ERP
  • Duplication / poor management of data ; Enterprise Systems tend to manage their core data well and their satellite data poorly.  For instance, that customer information you entered in the ERP might be duplicated for each order and some data might be concatenated into a ‘free form’ text field.
  • Difficulty to reconcile data between systems ; each system is likely going to use their own identification mechanism and it might be hard to reconcile two customer profile in two different CRMs, e.g. as a bank account & an insurance account.

Different strategies can be put in place to mitigate those issues.

A common one is to integrate different systems together, i.e. passing data from one system to another via some interfaces, either real-time, near real-time or in offline batches.  This approach helps each system have their own view of the world and keep the illusion that they are the center of your world.

image

Another strategy is to integrate data outside those systems, either in a Data warehouse or in a Master Data Management (MDM) system.  This approach recognizes that no system has a complete view of your strategic data and creates this view outside those systems.  It has the main advantage of freeing you from some of the constraints of each system to get a general view of the data.

image

Now what those data integration avenue have in common is:

  • Most of the data life cycle (e.g. historisation, retention) is still managed by your Enterprise systems
  • They are very expensive and rigid in nature, often spawning monthly projects to put in place

Now this was all nice and fine until a couple of years ago when internet broke loose and with the constant cost drop of storage created this new world of Big Data.

Big Data

Gartner defines Big Data as

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision-making, and process automation.

Today, your systems are probably spewing data.  As you engage your customers more and more on the web and in the mobile world, you get a was amount of telemetry & other valuable data.  You want to reference partner data related to your data, for instance, research data on your customer’s demographics.  More & more you find perceived value in information you were discarding quickly yesterday (e.g. logs).

This is where the concept of Data Lake becomes interesting.

Data Lake

A Data Lake isn’t a big data Data Warehouse.  It’s a large storage repository of raw data.  The term comes from the comparison of bottle water (structured, cleansed & managed data) with…  a lake (your raw, unprocessed & un-cleansed data).

Because it is raw data, it doesn’t have all the project cost related to data warehousing or MDM projects.  It also doesn’t have the benefits.

The idea is that you migrate all your strategic data into a Data Lake, that is you copy it over.  Once there you can later run analytics on it, in discovery mode (small teams, data science) first and later once you found value, in a more structure approach.

image

The majors advantages of a Data Lake are:

  • You store ALL your data, it is your treasure chest
  • The economic model is that you invest in processing the data when you choose to dig in and analyse it as oppose to upfront ; this buys you agility, the ability to trial different data science avenue, fail fast on analytics and mine all your data

The idea of keeping ALL your data at one spot to mine it later might seems like a cosmetic change to your Enterprise Architecture, but it is in fact a HUGE enabler!

A good examples are logs.  In traditional approaches logs would be mined for some information by querying system logs and extracting some information on it (e.g. aggregate visits) and storing the results in a data warehouse.  Comes a change in your business and you need to go back to the original logs to extract more information?  Well the logs from 3 months ago are gone and the one you have would require to re-open the data movement orchestration that took months to develop by a team of consultants who have since left the building.  Are you sure you need those?  Can you justify it?

With a Data Lake, you kept ALL your raw data, all your logs.  You can go back and mine information you didn’t use at first to see if the value you suspect is there actually is.

You can gradually refine your data, e.g. using checkpoints, the same way ore is refined in an industrial process.  But you can always go back to an unrefined stage and bring more material to the next level.

In many ways, those aren’t new concepts.  Typical Data warehouse systems will have staging environments where you’ll keep intermediary representation of your data.  The key difference is that typically you would flush lots of data in the name of server constraints.  In a Data Lake mindset you would typically have a Big Data approach where those constraints do not exist.

Data Lake in Azure

In Azure, you could use the new Data Lake Store to store both unstructured data (i.e. files) and structured data (i.e. tables).  This allows you to store your raw data and then gradually refine it in more structured way without worrying about the size of your data all along.  You could use standard blob storage if the size of your data is below 500 TB.  You could use the new Data Lake Analytics to process that data at scale, Azure Data Factory to orchestrate that processing and the movement of your data to more operational stores, such as Azure Datawarehouse or Azure SQL Database.

Conclusion

But whatever the actual technical solution you use, the concept of a Data Lake where you keep ALL your data for further processing, can help you realize infonomics on your Strategic data within your Enterprise.

Azure Data Lake Analytics – Loading files with custom C# code

UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In this post, I will load data from custom format files into Azure Data Lake Analytics (ADLA) tables.  This is the first step before running some analytics on the data.  It is the first checkpoint of the process.

When I say custom format file, I mean files that do not fall in the CSV / TSV / delimited file format.  This will lead us to explore how to use C# to do part of the parsing.

I won’t lie to you:  the product has a few rough edges in its first version of Public Preview, so I hope this post helps you find your destination faster!

Visual Studio Projects

In order to do this post, I’ve created a solution in Visual Studio with a U-SQL Project & a Class Library (For U-SQL Application) project.  Here are the final file structure of the solution:

image

We’ll go through it in this post.

So the plan here is to persists the three tables into three ADLA tables, following the checkpoint patterns.

Data set

I looked for some big data set and struggled a bit.  I either found tiny data sets with the equivalent of one-table or huge ones (multiple TB) which are a bit prohebitive for proof-of-concepts (POC) such as this post.

I settled for a fun one I found on Amazon AWSSocial characteristics of the Marvel Universe.  It’s basically a data set about which Marvel characters appear in which comic books.

It is a non-trivial size but by no mean big data set.  So using Hadoop, a big compute engine, on it is a bit of an overkill but it’s for POC, isn’t it?

It has three parts:

Those files have some Byte Order Mark (BOM) at the beginning to make things interesting.  Unfortunately at this point, this blows up in the USQL parsing.  The easiest way to remove the BOMs is to download the files locally, open them in Visual Studio, “Save As…”, then hit the arrow next to the save in the dialog box:

image

choose “Save with Encoding…”, choose UTF-8 without signature and save them.

image

While we are at it, for the last file, i.e. porgat.txt, we’ll need to remove the first part of the file.  Search for “*Edgeslist” and delete everything before it.  The file should now start with “1 6487”.

Let’s call those pre-processing…

Let’s copy those three files under your ADLS in one Marvel folder.

I won’t cover the basics-basics, please read Logistic / Get Started.

Create Database script

I want to put all the tables inside a database container so let’s create a database with two schemas:  one for the raw input files (as tables) and one for analytics (I’ll use it in a future post only).

In the CreateDbAndSchemas.usql file, let’s put the following content:

//  This drops the database (with its data) before recreating it
DROP DATABASE IF EXISTS Marvel;

//  Create a database
CREATE DATABASE Marvel;

//  Use the new database as the current
USE DATABASE Marvel;

//  Create two schemas
CREATE SCHEMA Input;
CREATE SCHEMA Analytic;

Remember USQL keywords must be uppercase in order not conflict with C# (e.g. SELECT vs select in Linq / C#).

You can submit that as a job and validate in the portal that the database is created.  You will not see the schemas there yet though.

Handling custom format:  function

I’ve tried different approach before seattling for this one.

Basically, the first two files share the same format while the latter has a different one.  I didn’t want to repeat the code for the first two files.

We could invoke some C# codes twice but I found that using an USQL function, itself invoking C# methods, to be the most elegant.

I fully covered how to register assemblies in this post.  For this, since we are invoking C# code from within a function, we are going to use the Visual Studio Register Assembly option.  The C# code is going to reside in a separate assembly.

Vertex Format

Let’s look at the first format, which I called the vertex format.  Here’s the beginning of vert1.txt:

Vertex 1: 24-HOUR MAN/EMMANUEL
Vertex 2: 3-D MAN/CHARLES CHANDLER & HAROLD CHANDLER
Vertex 3: 4-D MAN/MERCURIO
Vertex 4: 8-BALL/

The format is pretty obvious but it also doesn’t comply with any format supported by USQL extractors, i.e. it isn’t comma or tab separated…  This is why we’re going to use C# code to parse it.  The most elegant way is to use a regular expression to parse and extract the data.  Here is the C# helper class to accomplish that:

using System.Text.RegularExpressions;

namespace MarvelLib
{
  /// <summary>Helper class containing methods parsing a vertex format file.</summary>
  public static class VertexFormatHelper
  {
    private static readonly Regex _regex = new Regex(@"^\s*Vertex\s*(?<id>\d*)\s*:\s*(?<label>.*)\s*$");

    /// <summary>Predicate returning <c>true</c> iif a line of text is in the vertex format.</summary>
    /// <param name="line">Input line of text.</param>
    /// <returns></returns>
    public static bool IsVertexMatch(string line)
    {
      return _regex.IsMatch(line);
    }

    /// <summary>Get the ID from a vertex line.</summary>
    /// <param name="line">Input line of text.</param>
    /// <returns>ID of the vertex</returns>
    public static int GetID(string line)
    {
      var idText = _regex.Match(line).Groups["id"].Value;

      return int.Parse(idText);
    }

    /// <summary>Get the label from a vertex line.</summary>
    /// <param name="line">Input line of text.</param>
    /// <returns>Label of the vertex.</returns>
    public static string GetLabel(string line)
    {
      var label = _regex.Match(line).Groups["label"].Value;

      return label;
    }
  }
}

The Function consuming this is the following:

CREATE FUNCTION Input.GetVertexData(@fileSet string)
RETURNS @vertexData TABLE
(
  VertexID int,
  VertexLabel string
)
AS BEGIN
REFERENCE ASSEMBLY MarvelLib;

//  Extract the data with no schema, i.e. whole line
@lines =
  EXTRACT Line string
  FROM @fileSet
  // Hack in order not to have the extractor delimitate columns
  USING Extractors.Text(delimiter : '$');

//  Give the query a schema on read
@vertexData =
  //  Extract the vertex id & label from the line
  SELECT MarvelLib.VertexFormatHelper.GetID(Line) AS VertexID,
  MarvelLib.VertexFormatHelper.GetLabel(Line) AS VertexLabel
  FROM @lines
  WHERE MarvelLib.VertexFormatHelper.IsVertexMatch(Line);

RETURN;
END;

The function takes a string in parameter representing a file set.  A file set in U-SQL is either a file path or an expression representing multiple files (typically with ‘*’).

Notice the reference assembly operation within the body of the function.  This is necessary to use the C# method defined in that assembly.

The first expression parses the text file into rows.  We use a little hack to parse an entire row (line) at the time since U-SQL (at this point in time) doesn’t support to load one line at the time directly.

In the second expression we use the C# method to extract the ID and the label in each line.

The second expression is what is returned by the function.  We’ll see how we’ll further transform the expression.

Relation Format

Let’s look at the second format, which I called the relation format.  Here’s the beginning of porgat.txt:

1 6487
2 6488 6489 6490 6491 6492 6493 6494 6495 6496
3 6497 6498 6499 6500 6501 6502 6503 6504 6505
4 6506 6507 6508

Again the format is pretty straightforward but isn’t directly supported by U-SQL Extractors.

We’re going to use the following C# class (simply splitting the line on spaces) to help us out:

using Microsoft.Analytics.Types.Sql;
using System.Linq;

namespace MarvelLib
{
  public static class RelationFormatHelper
  {
    public static bool IsRelationMatch(string line)
    {
      var parts = line.Split(' ');

      return parts.Length != 0;
    }

    public static int GetID(string line)
    {
      var parts = line.Split(' ');

      return int.Parse(parts[0].Trim());
    }

    public static SqlArray<int> GetRelations(string line)
    {
      var parts = line.Split(' ');
      var relations = from p in parts.Skip(1)
                      select int.Parse(p);

      return SqlArray.Create(relations);
    }
  }
}

The function consuming this code is:

CREATE FUNCTION Input.GetRelationData(@fileSet string)
RETURNS @relationData TABLE
(
  MainID int,
  RelationID int
)
AS BEGIN
REFERENCE ASSEMBLY MarvelLib;

//  Extract the data with no schema, i.e. whole line
@lines =
  EXTRACT Line string
  FROM @fileSet
  // Hack in order not to have the extractor delimitate columns
  USING Extractors.Text(delimiter : '$');

//  Give the query a schema on read
@flatRelations =
  //  Extract the main id & the relations (as an array) from the line
  SELECT MarvelLib.RelationFormatHelper.GetID(Line) AS MainID,
        //   This is a SQL.Array<int>, a special U-SQL type
        MarvelLib.RelationFormatHelper.GetRelations(Line) AS RelationIDs
  FROM @lines
  WHERE MarvelLib.RelationFormatHelper.IsRelationMatch(Line);

//  Here we're gona unpack the relation IDs array on to many rows
@relationData =
  SELECT MainID,
          rid AS RelationID
  FROM @flatRelations
  CROSS APPLY EXPLODE (RelationIDs) AS r(rid);

RETURN;
END;

In the first expression we parse the lines of the file.

In the second expression, we use the C# methods to extract the main ID and the list of relation IDs as an SQL Array.

In the third expression, we explode the array of relation IDs on different rows.  I covered the CROSS APPLY EXPLODE in the following post.

The complete content of CreateFormatFunctions.usql is the following:

USE DATABASE Marvel;

DROP FUNCTION IF EXISTS Input.GetVertexData;
DROP FUNCTION IF EXISTS Input.GetRelationData;

CREATE FUNCTION Input.GetVertexData(@fileSet string)
RETURNS @vertexData TABLE
(
  VertexID int,
  VertexLabel string
)
AS BEGIN
REFERENCE ASSEMBLY MarvelLib;

//  Extract the data with no schema, i.e. whole line
@lines =
  EXTRACT Line string
  FROM @fileSet
  // Hack in order not to have the extractor delimitate columns
  USING Extractors.Text(delimiter : '$');

//  Give the query a schema on read
@vertexData =
  //  Extract the vertex id & label from the line
  SELECT MarvelLib.VertexFormatHelper.GetID(Line) AS VertexID,
  MarvelLib.VertexFormatHelper.GetLabel(Line) AS VertexLabel
  FROM @lines
  WHERE MarvelLib.VertexFormatHelper.IsVertexMatch(Line);

RETURN;
END;

CREATE FUNCTION Input.GetRelationData(@fileSet string)
RETURNS @relationData TABLE
(
  MainID int,
  RelationID int
)
AS BEGIN
REFERENCE ASSEMBLY MarvelLib;

//  Extract the data with no schema, i.e. whole line
@lines =
  EXTRACT Line string
  FROM @fileSet
  // Hack in order not to have the extractor delimitate columns
  USING Extractors.Text(delimiter : '$');

//  Give the query a schema on read
@flatRelations =
  //  Extract the main id & the relations (as an array) from the line
  SELECT MarvelLib.RelationFormatHelper.GetID(Line) AS MainID,
  //   This is a SQL.Array<int>, a special U-SQL type
  MarvelLib.RelationFormatHelper.GetRelations(Line) AS RelationIDs
  FROM @lines
  WHERE MarvelLib.RelationFormatHelper.IsRelationMatch(Line);

//  Here we're gona unpack the relation IDs array on to many rows
@relationData =
  SELECT MainID,
         rid AS RelationID
  FROM @flatRelations
  CROSS APPLY EXPLODE (RelationIDs) AS r(rid);

RETURN;
END;

You can submit the script as a job.  This will create the two functions.

Bringing it all together:  importing the data

We’ve built all the tools to easily import the files into ADLA tables.

In the file ImportData.usql:

USE DATABASE Marvel;

//  Load row sets
@characterVertices = Input.GetVertexData
                     (
                         "/Marvel/vert1.txt"
                     );
@bookVertices = Input.GetVertexData
                (
                    "/Marvel/vert2.txt"
                );
@characterBookIDs = Input.GetRelationData
                    (
                        "/Marvel/porgat.txt"
                    );

//  Schema on read:  project the generic schema into a specific one for characters
@characters =
    SELECT VertexID AS CharacterID,
           VertexLabel AS CharacterName
    FROM @characterVertices;
//  Schema on read:  project the generic schema into a specific one for books
@books =
    SELECT VertexID AS BookID,
           VertexLabel AS BookName
    FROM @bookVertices;
//  Schema on read:  project the generic schema into a specific one for books-characters
@characterBooks =
    SELECT MainID AS CharacterID,
           RelationID AS BookID
    FROM @characterBookIDs;

//  Drop the tables before recreating them to load data
DROP TABLE IF EXISTS Input.Character;
DROP TABLE IF EXISTS Input.Book;
DROP TABLE IF EXISTS Input.CharacterBook;

//  Create the character table as a managed table with a query
CREATE TABLE Input.Character
(
    INDEX CharacterIndex
    CLUSTERED(CharacterID ASC)
    PARTITIONED BY
    RANGE(CharacterID)
) AS
SELECT *
FROM @characters;

//  Create the book table as a managed table with a query
CREATE TABLE Input.Book
(
    INDEX BookIndex
    CLUSTERED(BookID ASC)
    PARTITIONED BY
    RANGE(BookID)
) AS
SELECT *
FROM @books;

//  Create the character-book relation table as a managed table with a query
CREATE TABLE Input.CharacterBook
(
    INDEX CharacterBookIndex
    CLUSTERED(CharacterID, BookID ASC)
    PARTITIONED BY
    RANGE(CharacterID, BookID)
) AS
SELECT *
FROM @characterBooks;

The script is pretty straightforward.  The patterns used are:

  • Schema on read:  we parse files and we defined schema while reading the files
  • Checkpoint:  we save the data into a table before doing further analytics on the data (out of scope for this post)

Conclusion

We covered a lot of material and I hope it is useful to author your scripts.

We demonstrated the power of C# code to parse files having non-trivial format.  The result is a row-set we can further manipulate with projections and others.

We also persist the data into indexed & partitionned ADLA tables.  This is useful because we’ll be able to run analytics on the data (in future posts) without processing the data from scratch.

U-SQL Cross Apply

UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In this post I’ll write about a specific U-SQL operation since its syntax is a bit tricky.

The Problem

U-SQL supports arrays in the form of the .NET type SQL.Array<T>.  This is very convenient in many scenarios, especially when you are parsing inputs.

We often want to take an array and explode it on different rows.  This is useful not only with arrays but with SQL.MAP<T,V>, or other complex types, e.g. XML or JSON.

This is accomplished by using the CROSS APPLY & EXPLODE operators.

The Solution

Let’s look at the solution.

@content =
	SELECT *
	FROM(
		VALUES
		(
			12,
			"a, b, c"
		),
		(
			75,
			"f, g, h, i, j"
		)) AS t([ID], LetterList);

@inArray =
	SELECT [ID],
		SqlArray.Create(LetterList.Split(',')) AS LetterArray
	FROM @content;

@exploded =
	SELECT [ID],
		letter.Trim() AS Letter
	FROM @inArray
	CROSS APPLY
	EXPLODE(LetterArray) AS r(letter);

OUTPUT @exploded
TO "/Outputs/explosion.tsv"
USING Outputters.Tsv();

In the first expression, I create a table from a string:

ID LetterList
12 “a, b, c”
75 “f, g, h, j, k”

In the second expression, I transform the second column into an array:

ID LetterList
12 {“a”, ” b”, ” c”}
75 {“f”, ” g”, ” h”, ” j”, ” k”}

In the third and final expression, I explode the arrays on different rows (and trim the string to remove trailing spaces):

ID Letter
12 “a”
12 “b”
12 “c”
75 “f”
75 “g”
75 “h”
75 “i”
75 “j”

Once you know it the syntax is pretty simple but it isn’t trivial to deduce that from the documentation, hence this post.

Conclusion

Very useful to basically pivot raw arrays to rows, CROSS APPLY with EXPLODE is another tool to help you parse semi-structured data into structured data (tables).

Registering assemblies in Azure Data Lake Analytics

UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In this post I’ll write about how to register assemblies in Azure Data Lake Analytics (ADLA).

This one took me quite a while to figure out, no thanks to the beta state of the tooling.

The problem

Let’s start with the problem.  Let’s say we need to have some C# custom code and share it among multiple USQL scripts.

I’m talking about “complex code”, not inline C# code you insert within a USQL script.  The following is inline C#:

SELECT
Line.Split('|')[0]
FROM @lines

Here we simply call the string.Split method inline a select statement within a USQL script.  This is “complex code” called in USQL:

SELECT
MyNamespace.MyClass.MyMethod(Line)
FROM @lines

where, of course, MyNamespace.MyClass.MyMethod is defined somewhere.

Inline code works perfectly well and is well supported.  For complex code, you need to register assemblies and this is where the fun begins.

Now you’ll often need to go with complex code because inline code is a bit limited.  You can’t instantiate object and hold references to them in U-SQL right now.  So inline code really is that:  inline method call.

I’ll show you different approaches available and tell you about their shortcomings.

Keep in mind that this is based on the public preview and that I write those lines early January 2016.  Very likely, a lot if not all those shortcomings will disapear in future releases.

Code behind

The easiest way to do complex code is to use the code-behind a script.

image

This should look familiar to you if you’ve done any Visual Studio with ASP.NET, WPF, Win Forms and other stacks.

In the code-behind you can author classes & methods and invoke those in the U-SQL script.

Now when you submit your script, Visual Studio performs some magic on your behalf.  To see that magic, let’s look at an example:

@lines =
EXTRACT Line string
FROM "/Marvel/vert1.txt"
USING Extractors.Text(delimiter : '$');

@trans =
SELECT Mynamespace.MyClass.Hello(Line)
FROM @lines;

OUTPUT @trans
TO "bla"
USING Outputters.Csv();

This is a tad ceremonious, but you need to have an output for a script to be valid and it’s easier to take an input than create one from scratch.  Anyhow, the important part is the invocation of the Hello method.  Now here’s the code behind:

namespace MyNamespace
{
	public static class MyClass
	{
		public static string Hello(string s)
		{
			return "Hello " + s;
		}
	}
}

Now if you submit that script as a job and look at the generated script, by clicking at the bottom left “Script link” in the job tab:

image

You’ll see the script submitted to the ADLA engine:

// Generated Code Behind Header
CREATE ASSEMBLY [__codeBehind_gv215f0m.00i] FROM 0x4D5A900003000...;
REFERENCE ASSEMBLY [__codeBehind_gv215f0m.00i];

// Generated Code Behind Header
@lines =
EXTRACT Line string
FROM "/Marvel/vert1.txt"
USING Extractors.Text(delimiter : '$');

@trans =
SELECT Mynamespace.MyClass.Hello(Line)
FROM @lines;

OUTPUT @trans
TO "bla"
USING Outputters.Csv();
// Generated Code Behind Footer
USE DATABASE [master];
USE SCHEMA [dbo];

DROP ASSEMBLY [__codeBehind_gv215f0m.00i];
// Generated Code Behind Footer

You see that a few lines were added.  Basically, the script is augmented to register an assembly and to drop it (delete it) at the end of the script.

The assembly is registered by emitting its byte-code inline in hexadecimal.  A bit crude, but it seems to work.

Now this works well but it as a few limitations:

  1. You can’t share code between scripts:  only the code-behind a given script is emitted in the registered assembly.  So this solution isn’t good to share code accross scripts.
  2. The assembly is available only for the duration of your script.  This is fine if you want to invoke so C# code on queries for instance.  On the other hand, if you want to create, say, a USQL function using C# code and invoke that function in another script, that will fail.  The way the runtime works, your assembly would be required by the time the calling script gets executed.  But since the script creating the function would register and then drop the assembly, that assembly wouldn’t be around later.

So if this solution works for your requirements:  use it.  It is by far the simplest available.

Visual Studio Register Assembly menu option

Create a library project, i.e. a Class Library (For U-SQL Application) template.

image

This allows you to create code independant of scripts.  Right click on the project and select the last option on the menu.

image

This will pop up a dialog with a few options.

image

Now be careful and always click the “Replace assembly if it already exists” option, otherwise you can only create it once.

Select which ADLA account and which DB you want the assembly to be registered in and submit the job.

Again, if you look at the script submitted to ADLA, it looks like this:

USE DATABASE [master];
DROP ASSEMBLY IF EXISTS [XYZ];
CREATE ASSEMBLY [XYZ] FROM 0x4D5A90000300000004000000FFFF0000…

So the assembly is registered independant of other scripts on your behalf.  This is done again by emitting the assembly’s byte-code inline.

The major inconvenience with this method is that you need to register it manually as oppose to just recompile.

Registering it manually

Now, let’s go hard core.  We’ve seen how Visual Studio does it, why can’t we do the same?

Well, not exactly the same unless you want to input the byte-code in hexadecimal.

If you look at the documentation we can see there is another way to register an assembly:  by refering the dll in the Azure storage:

USE DATABASE Marvel;
DROP ASSEMBLY IF EXISTS XYZ;
CREATE ASSEMBLY XYZ FROM "<my location>";

Now the major drawbacks of this approach are

  1. You have to do it manually, in the sense it doesn’t happend automatically when you compile.
  2. You need to compile your libary and upload the dlls into the storage and then submit the registring script.
  3. If you change the files in the storage, it doesn’t change the assembly used by the scripts.  You need to drop & re-create the assembly.

Conclusion / My Recommendations

I would say at the moment, with the current tooling, there is no perfect solution.  So I would recommend the solutions we explored in given contexts.

  1. Inline C#
    • By far the simplest and better supported
    • Use if you can do with inline and do not need to share accross scripts
  2. Code Behind
    • Use if you do not need to share accross scripts
    • Use if your C# code is only called in your script and won’t be called by other scripts via function or procedure you create in your script
  3. Visual Studio Register Assembly option
    • Use if you need to share accross scripts
    • Use if you do not need to integrate into auto build and do not mind the manual process
  4. Manual Registering
    • Use if you need to share accross scripts
    • Use if you need to integrate in your continuous build system
    • Consider automating the process by have tasks copying the assembly to the storage and submitting the assembly registering automatically as part of the build process

So those are my recommendations.  Let me know if you have any comments / questions!