Querying tables in Azure Data Lake Analytics

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this or have a look at Azure Data Lake series for more posts on Azure Data Lake.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In my last post, we did import data from the Social characteristics of the Marvel Universe data set into ADLA tables.  In this post, I will query those tables to get some insights out of them.

Data Model

The data model in the ADLA tables is the following:

image

Popular Character

Obvious question:  which character is the most popular, i.e. which character has been published the most?

Let’s ask the question:

USE DATABASE Marvel;
 
//  Most popular characters
@characters =
    SELECT c.CharacterID,
           c.CharacterName,
           COUNT(cb.BookID) AS PublicationCount
    FROM Input.Character AS c
         INNER JOIN
             Input.CharacterBook AS cb
         ON cb.CharacterID == c.CharacterID
    GROUP BY c.CharacterID,
             c.CharacterName;
 
@charactersWithCount =
    SELECT *
    FROM @characters
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;
 
OUTPUT @charactersWithCount
TO "/Outputs/TopCharacters.tsv"
USING Outputters.Tsv();
 

We get the predictable result:

Popular DUOs

Another obvious question is:  which 2 characters are published together the most?

USE DATABASE Marvel;
 
//  Most popular duo of characters
@duoCount =
    SELECT COUNT(cb1.BookID) AS PublicationCount,
           cb1.CharacterID AS CharacterID1,
           cb2.CharacterID AS CharacterID2
    FROM Input.CharacterBook AS cb1
         INNER JOIN
             Input.CharacterBook AS cb2
         ON cb1.BookID == cb2.BookID
         // Eliminate duos consisting of the same character
    WHERE cb1.CharacterID != cb2.CharacterID
         // Making sure the same duo will be there only once
         AND cb1.CharacterID < cb2.CharacterID
    GROUP BY cb1.CharacterID,
             cb2.CharacterID;
 
@sortedDuos =
    SELECT dc.CharacterID1,
           c1.CharacterName AS CharacterName1,
           dc.CharacterID2,
           c2.CharacterName AS CharacterName2,
           dc.PublicationCount
    FROM @duoCount AS dc
         INNER JOIN
             Input.Character AS c1
         ON c1.CharacterID == dc.CharacterID1
         INNER JOIN
             Input.Character AS c2
         ON c2.CharacterID == dc.CharacterID2
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;
 
OUTPUT @sortedDuos
TO "/Outputs/TopDuos.tsv"
USING Outputters.Tsv();

Again, if you know the Marvel Universe, the results make sense:

Conclusion

We’ve seen how to do simple analytics using USQL on ADLA tables.

The data set I’m using is relatively small but with a Big Data set, the power of Hadoop opens lots of possibilities.

You can explore your data until you find some queries that make sense.  This means you can very easily explore big data set without provisionning servers or even VMs.


Leave a comment