Querying tables in Azure Data Lake Analytics
Solution ·Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.
You can get started by reading this or have a look at Azure Data Lake series for more posts on Azure Data Lake.
I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.
In my last post, we did import data from the Social characteristics of the Marvel Universe data set into ADLA tables. In this post, I will query those tables to get some insights out of them.
Data Model
The data model in the ADLA tables is the following:
Popular Character
Obvious question: which character is the most popular, i.e. which character has been published the most?
Let’s ask the question:
USE DATABASE Marvel;
// Most popular characters
@characters =
SELECT c.CharacterID,
c.CharacterName,
COUNT(cb.BookID) AS PublicationCount
FROM Input.Character AS c
INNER JOIN
Input.CharacterBook AS cb
ON cb.CharacterID == c.CharacterID
GROUP BY c.CharacterID,
c.CharacterName;
@charactersWithCount =
SELECT *
FROM @characters
ORDER BY PublicationCount DESC
FETCH FIRST 15 ROWS ONLY;
OUTPUT @charactersWithCount
TO "/Outputs/TopCharacters.tsv"
USING Outputters.Tsv();
We get the predictable result:
- SPIDER-MAN/PETER PARKER
- CAPTAIN AMERICA
- IRON MAN/TONY STARK
- THING/BENJAMIN J. GRIMM
- THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
- HUMAN TORCH/JOHNNY STORM
- MR. FANTASTIC/REED RICHARDS
- HULK/DR. ROBERT BRUCE BANNER
- WOLVERINE/LOGAN
- INVISIBLE WOMAN/SUE STORM RICHARDS
- BEAST/HENRY &HANK& P. MCCOY
- SCARLET WITCH/WANDA MAXIMOFF
- DR. STRANGE/STEPHEN STRANGE
- DAREDEVIL/MATT MURDOCK
- WATSON-PARKER, MARY JANE
Popular DUOs
Another obvious question is: which 2 characters are published together the most?
USE DATABASE Marvel;
// Most popular duo of characters
@duoCount =
SELECT COUNT(cb1.BookID) AS PublicationCount,
cb1.CharacterID AS CharacterID1,
cb2.CharacterID AS CharacterID2
FROM Input.CharacterBook AS cb1
INNER JOIN
Input.CharacterBook AS cb2
ON cb1.BookID == cb2.BookID
// Eliminate duos consisting of the same character
WHERE cb1.CharacterID != cb2.CharacterID
// Making sure the same duo will be there only once
AND cb1.CharacterID < cb2.CharacterID
GROUP BY cb1.CharacterID,
cb2.CharacterID;
@sortedDuos =
SELECT dc.CharacterID1,
c1.CharacterName AS CharacterName1,
dc.CharacterID2,
c2.CharacterName AS CharacterName2,
dc.PublicationCount
FROM @duoCount AS dc
INNER JOIN
Input.Character AS c1
ON c1.CharacterID == dc.CharacterID1
INNER JOIN
Input.Character AS c2
ON c2.CharacterID == dc.CharacterID2
ORDER BY PublicationCount DESC
FETCH FIRST 15 ROWS ONLY;
OUTPUT @sortedDuos
TO "/Outputs/TopDuos.tsv"
USING Outputters.Tsv();
Again, if you know the Marvel Universe, the results make sense:
- HUMAN TORCH/JOHNNY STORM & THING/BENJAMIN J. GRIMM
- HUMAN TORCH/JOHNNY STORM & MR. FANTASTIC/REED RICHARDS
- MR. FANTASTIC/REED RICHARDS & THING/BENJAMIN J. GRIMM
- INVISIBLE WOMAN/SUE STORM RICHARDS & MR. FANTASTIC/REED RICHARDS
- HUMAN TORCH/JOHNNY STORM & INVISIBLE WOMAN/SUE STORM RICHARDS
- INVISIBLE WOMAN/SUE STORM RICHARDS & THING/BENJAMIN J. GRIMM
- SPIDER-MAN/PETER PARKER & WATSON-PARKER, MARY JANE
- JAMESON, J. JONAH & SPIDER-MAN/PETER PARKER
- CAPTAIN AMERICA & IRON MAN/TONY STARK
- SCARLET WITCH/WANDA MAXIMOFF & VISION
- ANT-MAN/DR. HENRY J. PYM & WASP/JANET VAN DYNE PYM
- CYCLOPS/SCOTT SUMMERS & MARVEL GIRL/JEAN GREY SUMMERS
- STORM/ORORO MUNROE Subscribe! & WOLVERINE/LOGAN
- CAPTAIN AMERICA & THOR/DR. DONALD BLAKE/SIGURD JARLSON II/JAKE OLSON/LOREN OLSON
- CAPTAIN AMERICA & VISION
Conclusion
We’ve seen how to do simple analytics using USQL on ADLA tables.
The data set I’m using is relatively small but with a Big Data set, the power of Hadoop opens lots of possibilities.
You can explore your data until you find some queries that make sense. This means you can very easily explore big data set without provisionning servers or even VMs.