Querying tables in Azure Data Lake Analytics


Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this or have a look at Azure Data Lake series for more posts on Azure Data Lake.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In my last post, we did import data from the Social characteristics of the Marvel Universe data set into ADLA tables.  In this post, I will query those tables to get some insights out of them.

Data Model

The data model in the ADLA tables is the following:

image

Popular Character

Obvious question:  which character is the most popular, i.e. which character has been published the most?

Let’s ask the question:

USE DATABASE Marvel;
 
//  Most popular characters
@characters =
    SELECT c.CharacterID,
           c.CharacterName,
           COUNT(cb.BookID) AS PublicationCount
    FROM Input.Character AS c
         INNER JOIN
             Input.CharacterBook AS cb
         ON cb.CharacterID == c.CharacterID
    GROUP BY c.CharacterID,
             c.CharacterName;
 
@charactersWithCount =
    SELECT *
    FROM @characters
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;
 
OUTPUT @charactersWithCount
TO "/Outputs/TopCharacters.tsv"
USING Outputters.Tsv();
 

We get the predictable result:

  • SPIDER-MAN/PETER PARKER
  • CAPTAIN AMERICA
  • IRON MAN/TONY STARK
  • THING/BENJAMIN J. GRIMM
  • THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
  • HUMAN TORCH/JOHNNY STORM
  • MR. FANTASTIC/REED RICHARDS
  • HULK/DR. ROBERT BRUCE BANNER
  • WOLVERINE/LOGAN
  • INVISIBLE WOMAN/SUE STORM RICHARDS
  • BEAST/HENRY &HANK& P. MCCOY
  • SCARLET WITCH/WANDA MAXIMOFF
  • DR. STRANGE/STEPHEN STRANGE
  • DAREDEVIL/MATT MURDOCK
  • WATSON-PARKER, MARY JANE

Popular DUOs

Another obvious question is:  which 2 characters are published together the most?

USE DATABASE Marvel;

//  Most popular duo of characters
@duoCount =
    SELECT COUNT(cb1.BookID) AS PublicationCount,
           cb1.CharacterID AS CharacterID1,
           cb2.CharacterID AS CharacterID2
    FROM Input.CharacterBook AS cb1
         INNER JOIN
             Input.CharacterBook AS cb2
         ON cb1.BookID == cb2.BookID
         // Eliminate duos consisting of the same character
    WHERE cb1.CharacterID != cb2.CharacterID
         // Making sure the same duo will be there only once
         AND cb1.CharacterID < cb2.CharacterID
    GROUP BY cb1.CharacterID,
             cb2.CharacterID;

@sortedDuos =
    SELECT dc.CharacterID1,
           c1.CharacterName AS CharacterName1,
           dc.CharacterID2,
           c2.CharacterName AS CharacterName2,
           dc.PublicationCount
    FROM @duoCount AS dc
         INNER JOIN
             Input.Character AS c1
         ON c1.CharacterID == dc.CharacterID1
         INNER JOIN
             Input.Character AS c2
         ON c2.CharacterID == dc.CharacterID2
    ORDER BY PublicationCount DESC
    FETCH FIRST 15 ROWS ONLY;

OUTPUT @sortedDuos
TO "/Outputs/TopDuos.tsv"
USING Outputters.Tsv();

[/code]

Again, if you know the Marvel Universe, the results make sense:

  • HUMAN TORCH/JOHNNY STORM & THING/BENJAMIN J. GRIMM
  • HUMAN TORCH/JOHNNY STORM & MR. FANTASTIC/REED RICHARDS
  • MR. FANTASTIC/REED RICHARDS & THING/BENJAMIN J. GRIMM
  • INVISIBLE WOMAN/SUE STORM RICHARDS & MR. FANTASTIC/REED RICHARDS
  • HUMAN TORCH/JOHNNY STORM & INVISIBLE WOMAN/SUE STORM RICHARDS
  • INVISIBLE WOMAN/SUE STORM RICHARDS & THING/BENJAMIN J. GRIMM
  • SPIDER-MAN/PETER PARKER & WATSON-PARKER, MARY JANE
  • JAMESON, J. JONAH & SPIDER-MAN/PETER PARKER
  • CAPTAIN AMERICA & IRON MAN/TONY STARK
  • SCARLET WITCH/WANDA MAXIMOFF & VISION
  • ANT-MAN/DR. HENRY J. PYM & WASP/JANET VAN DYNE PYM
  • CYCLOPS/SCOTT SUMMERS & MARVEL GIRL/JEAN GREY SUMMERS
  • STORM/ORORO MUNROE Subscribe! & WOLVERINE/LOGAN
  • CAPTAIN AMERICA & THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
  • CAPTAIN AMERICA & VISION

Conclusion

We’ve seen how to do simple analytics using USQL on ADLA tables.

The data set I’m using is relatively small but with a Big Data set, the power of Hadoop opens lots of possibilities.

You can explore your data until you find some queries that make sense.  This means you can very easily explore big data set without provisionning servers or even VMs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s