Registering assemblies in Azure Data Lake Analytics


UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

Azure Data Lake (both Storage & Analytics) has been in public preview for a month or two.

You can get started by reading this.

I thought I would kick some posts about more complex scenarios to display what’s possibile with that technology.

In this post I’ll write about how to register assemblies in Azure Data Lake Analytics (ADLA).

This one took me quite a while to figure out, no thanks to the beta state of the tooling.

The problem

Let’s start with the problem.  Let’s say we need to have some C# custom code and share it among multiple USQL scripts.

I’m talking about “complex code”, not inline C# code you insert within a USQL script.  The following is inline C#:

SELECT
Line.Split('|')[0]
FROM @lines

Here we simply call the string.Split method inline a select statement within a USQL script.  This is “complex code” called in USQL:

SELECT
MyNamespace.MyClass.MyMethod(Line)
FROM @lines

where, of course, MyNamespace.MyClass.MyMethod is defined somewhere.

Inline code works perfectly well and is well supported.  For complex code, you need to register assemblies and this is where the fun begins.

Now you’ll often need to go with complex code because inline code is a bit limited.  You can’t instantiate object and hold references to them in U-SQL right now.  So inline code really is that:  inline method call.

I’ll show you different approaches available and tell you about their shortcomings.

Keep in mind that this is based on the public preview and that I write those lines early January 2016.  Very likely, a lot if not all those shortcomings will disapear in future releases.

Code behind

The easiest way to do complex code is to use the code-behind a script.

image

This should look familiar to you if you’ve done any Visual Studio with ASP.NET, WPF, Win Forms and other stacks.

In the code-behind you can author classes & methods and invoke those in the U-SQL script.

Now when you submit your script, Visual Studio performs some magic on your behalf.  To see that magic, let’s look at an example:

@lines =
EXTRACT Line string
FROM "/Marvel/vert1.txt"
USING Extractors.Text(delimiter : '$');

@trans =
SELECT Mynamespace.MyClass.Hello(Line)
FROM @lines;

OUTPUT @trans
TO "bla"
USING Outputters.Csv();

This is a tad ceremonious, but you need to have an output for a script to be valid and it’s easier to take an input than create one from scratch.  Anyhow, the important part is the invocation of the Hello method.  Now here’s the code behind:

namespace MyNamespace
{
	public static class MyClass
	{
		public static string Hello(string s)
		{
			return "Hello " + s;
		}
	}
}

Now if you submit that script as a job and look at the generated script, by clicking at the bottom left “Script link” in the job tab:

image

You’ll see the script submitted to the ADLA engine:

// Generated Code Behind Header
CREATE ASSEMBLY [__codeBehind_gv215f0m.00i] FROM 0x4D5A900003000...;
REFERENCE ASSEMBLY [__codeBehind_gv215f0m.00i];

// Generated Code Behind Header
@lines =
EXTRACT Line string
FROM "/Marvel/vert1.txt"
USING Extractors.Text(delimiter : '$');

@trans =
SELECT Mynamespace.MyClass.Hello(Line)
FROM @lines;

OUTPUT @trans
TO "bla"
USING Outputters.Csv();
// Generated Code Behind Footer
USE DATABASE [master];
USE SCHEMA [dbo];

DROP ASSEMBLY [__codeBehind_gv215f0m.00i];
// Generated Code Behind Footer

You see that a few lines were added.  Basically, the script is augmented to register an assembly and to drop it (delete it) at the end of the script.

The assembly is registered by emitting its byte-code inline in hexadecimal.  A bit crude, but it seems to work.

Now this works well but it as a few limitations:

  1. You can’t share code between scripts:  only the code-behind a given script is emitted in the registered assembly.  So this solution isn’t good to share code accross scripts.
  2. The assembly is available only for the duration of your script.  This is fine if you want to invoke so C# code on queries for instance.  On the other hand, if you want to create, say, a USQL function using C# code and invoke that function in another script, that will fail.  The way the runtime works, your assembly would be required by the time the calling script gets executed.  But since the script creating the function would register and then drop the assembly, that assembly wouldn’t be around later.

So if this solution works for your requirements:  use it.  It is by far the simplest available.

Visual Studio Register Assembly menu option

Create a library project, i.e. a Class Library (For U-SQL Application) template.

image

This allows you to create code independant of scripts.  Right click on the project and select the last option on the menu.

image

This will pop up a dialog with a few options.

image

Now be careful and always click the “Replace assembly if it already exists” option, otherwise you can only create it once.

Select which ADLA account and which DB you want the assembly to be registered in and submit the job.

Again, if you look at the script submitted to ADLA, it looks like this:

USE DATABASE [master];
DROP ASSEMBLY IF EXISTS [XYZ];
CREATE ASSEMBLY [XYZ] FROM 0x4D5A90000300000004000000FFFF0000…

So the assembly is registered independant of other scripts on your behalf.  This is done again by emitting the assembly’s byte-code inline.

The major inconvenience with this method is that you need to register it manually as oppose to just recompile.

Registering it manually

Now, let’s go hard core.  We’ve seen how Visual Studio does it, why can’t we do the same?

Well, not exactly the same unless you want to input the byte-code in hexadecimal.

If you look at the documentation we can see there is another way to register an assembly:  by refering the dll in the Azure storage:

USE DATABASE Marvel;
DROP ASSEMBLY IF EXISTS XYZ;
CREATE ASSEMBLY XYZ FROM "<my location>";

Now the major drawbacks of this approach are

  1. You have to do it manually, in the sense it doesn’t happend automatically when you compile.
  2. You need to compile your libary and upload the dlls into the storage and then submit the registring script.
  3. If you change the files in the storage, it doesn’t change the assembly used by the scripts.  You need to drop & re-create the assembly.

Conclusion / My Recommendations

I would say at the moment, with the current tooling, there is no perfect solution.  So I would recommend the solutions we explored in given contexts.

  1. Inline C#
    • By far the simplest and better supported
    • Use if you can do with inline and do not need to share accross scripts
  2. Code Behind
    • Use if you do not need to share accross scripts
    • Use if your C# code is only called in your script and won’t be called by other scripts via function or procedure you create in your script
  3. Visual Studio Register Assembly option
    • Use if you need to share accross scripts
    • Use if you do not need to integrate into auto build and do not mind the manual process
  4. Manual Registering
    • Use if you need to share accross scripts
    • Use if you need to integrate in your continuous build system
    • Consider automating the process by have tasks copying the assembly to the storage and submitting the assembly registering automatically as part of the build process

So those are my recommendations.  Let me know if you have any comments / questions!

3 thoughts on “Registering assemblies in Azure Data Lake Analytics

  1. Pingback: Registering assemblies in Azure Data Lake Analytics | Dinesh Ram Kali.

  2. Pingback: Azure Data Lake Analytics – Loading files with custom C# code | Vincent-Philippe Lauzon's blog

  3. Pingback: Analyzing Web Logs with Azure Data Lake Analytics (ADLA) | Vincent-Philippe Lauzon's blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s