Free ebook: Azure Machine Learning

4477.9780735698178_2D00_FB_5F00_732F5A13[1]You’re into Machine Learning, got into Azure ML, looked at my couple of blogs about it and want to take it to the next level?

Microsoft released an eBook for that exact purpose:

Free ebook: Azure Machine Learning (Microsoft Azure Essentials)

That book is targeted at people who want to get better with Azure ML tool:  developer, data scientist, data analyst, etc.  .  It goes into a bit of conceptual (Machine Learning in general) and quickly dive into practical examples with step by step and screen shots.

The book is available in ePub (ideal for commuting!), PDF & Mobi.

Here are the chapters:

  • Chapter 1, “Introduction to the science of data,” shows how Azure Machine Learning represents a critical step forward in democratizing data science by making available a fully-managed cloud service for building predictive analytics solutions.
  • Chapter 2, “Getting started with Azure Machine Learning,” covers the basic concepts behind the science and methodology of predictive analytics.
  • Chapter 3, “Using Azure ML Studio,” explores the basic fundamentals of Azure Machine Learning Studio and helps you get started on your path towards data science greatness.
  • Chapter 4, “Creating Azure ML client and server applications.” expands on a working Azure Machine Learning predictive model and explores the types of client and server applications that you can create to consume Azure Machine Learning web services.
  • Chapter 5, “Regression analytics,” takes a deeper look at some of the more advanced machine learning algorithms that are exposed in Azure ML Studio.
  • Chapter 6, “Cluster analytics,” explores scenarios where the machine conducts its own analysis on the dataset, determines relationships, infers logical groupings, and generally attempts to make sense of chaos by literally determining the forests from the trees.
  • Chapter 7, “The Azure ML Matchbox recommender,” explains one of the most powerful and pervasive implementations of predictive analytics in use today on the web today and how it is crucial to success in many consumer industries.
  • Chapter 8, “Retraining Azure ML models,” explores the mechanisms for incorporating “continuous learning” into the workflow for our predictive models.

Enjoy!

 

Azure ML – Over fitting with Neural Networks

In a past post, I discussed the concept of over fitting in Machine Learning.  I also alluded to it in my post about Polynomial Regression.

Basically, over fitting occurs when your model performs well on training data and poorly on data it hasn’t seen.

In here I’ll give an example using Artificial Neural Networks.  Those can be quite prone to over fitting since they have variable number of parameters, i.e. different number of hidden nodes.  Over fitting will always occur once you put too many parameters in a model.

Data

I’ll reuse the height-weight data set I had you created in a past post.  If you need to recreate it, go back to that post.

Then let’s create a new experiment, let’s title it “Height-Weight – Overfitting” and let’s drop the data set on it.

We do not want the Index column and I would like to rename the columns.  Let’s use our new friend the Apply SQL Transformation module with the following SQL expression:

SELECT
“Height(Inches)” AS Height,
“Weight(Pounds)” AS Weight
FROM t1

In one hit, I renamed fields and remove another one.

We will then drop a Split module and connect it to the data set.  We will configure the split module as follow:

image

The rest can stay as is.

Basically, our data set has 200 records and we’re going to take only the first 5 (.025 or %2.5) to train a neural network and the next 195 to test.

Here I remove the “Randomized split” so you can obtain results comparable to mines.  Usually you leave that on.

As you can see I’m really setting up a gross over fitting scenario where I starve my learning algorithm, showing it only a little data.

It might seem aggressive to take only %2.5 of the data for training and it is.  Usually you would take %60 and above.  It’s just that I want to demonstrate over fitting with only 2 dimensions.  It usually occur at higher dimensions, so in order to simulate it at low dimension, I go a bit more aggressively with the training set size.

The experiment should look like this so far

image

Learning

Let’s drop Neural Network Regression, Train Model, Score Model & Evaluate Model modules on the surface and connect them like this:

image

In the Train Model module we select the weight column.  This is the column we want to predict.

In the Neural Network Regression module, we set the number of hidden nodes to 2.

image

To really drive the point home and increase over fitting, let’s crank the number of learning iteration to 10 000.  This will be useful when we’ll increase the number of parameters.

image

And let’s leave the rest as is.

Let’s run the experiment.  This will train a 2-hidden nodes neural network with 5 records and evaluates it against those 5 records.

We can look at the result of Evaluate Model module:

image

The metrics are defined here.  We are going to look at the Relative Squared Error.

The numbers you get are likely different since Neural Networks are optimized using approximation methods and randomization.  So each run might yield different results.

Testing

Now how does the model performs on data it didn’t see in its training?

Let’s drop another Score Model & Evaluate Model modules and connect them like this:

image

Basically we will compute the score, or prediction, using the same train model but on different data, on the 195 remaining records not used during testing.

We run the experiment again and we get the following results on the test evaluation:

image

The evaluation is higher than the training data.  It nearly always is.  Let’s see how does evolve when we increase the number of hidden nodes.

Comparing with different number of nodes

We are going to compare how those metrics evolve when we change the number of parameters of the Neural Network model, i.e. the number of hidden nodes.

I do not know of a way to “loop” in AzureML.  It would be very nice if I could wrap the experiment in a loop and have the number of hidden nodes of the model vary within the loop.  If you know how to do that, please leave a comment!

Failing that, we are going to manually change the number of hidden node in the Neural Network Regression module.

In order to make our life easier, let’s make the reporting of the results we are looking for more straightforward than having to open the results of the two Evaluate Model modules.  Let’s drop another Apply SQL Transformation and connect it this way:

image

and type the following SQL expression in:

SELECT
t1.”Relative Squared Error” AS TrainingRSE,
t2.”Relative Squared Error” AS TestingRSE
FROM t1, t2

We are basically taking both outputs of Evaluation Model modules and renaming them which gives us (after another run) the nice result:

image

Neat, hen?

Ok, now let’s grind and manually change the hidden number of nodes a few time to fill the following table:

# Nodes Training RSE Testing RSE Ratio
2 0.154358 3.270625 4.7%
3 0.154394 3.366281 4.6%
4 0.154442 3.673096 4.2%
5 0.154488 3.455582 4.5%
7 0.154612 3.835834 4.0%
10 0.154847 4.242703 3.6%
15 0.155334 4.301146 3.6%
20 0.155946 4.281125 3.6%
30 0.157558 4.222742 3.7%
50 0.162018 3.586147 4.5%
75 0.006491 3.920912 0.2%
100 0.006791 3.082774 0.2%
150 0.000025 2.544964 0.0%
200 0.000015 2.249117 0.0%

We can see that as we increased the number of parameters, the training error got lower and the testing error got higher.  At some point, the testing error start going down again, but the ratio between the two always go down.

A more visual way to look at it is to look at the actual predictions done by the models.

image

The green dots are the 200 points, the 5 yellow dots are the training set while the blue dots are the prediction for a Neural Network with 200 hidden nodes.

We can tell the prediction curves goes perfectly on the training points but poorly describe the entire set.

Summary

I wanted to show what over fitting could be like.

Please note that I did exaggerate a lot of elements:  the huge number of training iterations, the huge number of hidden nodes.

The main point I want you to remember is to always test your model and do not select the maximum number of parameters available automatically!

Basically over fitting is like learning by heart.  The learning algorithm learns the training set perfectly and then generalizes poorly.

Docker Containers on Windows Server

If you had any doubts about the increased pace in IT innovation, look at Docker Containers.  The project was open sources in March 2013 as a container technology for Linux and 1.5 years later, in Octobre 2014, Microsoft announced they were integrating that technology on Windows Server 2016!

That’s 1.5 years from toe in the water to major influence.  Impressive!

logo[4]

The first Windows Server Container Preview has been announced in August 2015 as part of the Technical Preview 3 of Windows Server.  The preview also comes with Visual Studio integration, in the form of Visual Studio tools for Docker.

Mark Russinovich also published a very good technical post about Docker containers on Windows:  what they are, what are the advantages & scenarios where they nicely apply to.

Basically, Docker Containers are standard packages to deploy solution on a host.  The main advantage of having Docker Containers are in the small footprint of the container which results in a higher density of applications on a given host and a very quick startup time, compare to a Virtual Machine where the entire OS must be loaded in memory and booted.

In Windows, hosts will come in two flavours:  Windows Server host & Hyper-V host.  The former will maximize resource utilization and container density on a host while the latter maximizes isolation.

At first the Hyper-V container sounds like it defies the purpose of having Docker Containers in the first place since they basically implement the container as an entire VM.  But if you think about it, on the long run it makes perfect sense.  The first version of Docker Container on Windows will likely have security holes in them.  Therefore if you have scenario with ‘hostile multi-tenants’, you’ll probably want to stick to Hyper-V.  But in time, the security of Docker on Windows will tighten and you’ll be able to move to normal containers as a configuration-change.

Service Fabric

We can imagine that once Windows Server 2016 roll out, we’ll see Docker Container appearing in Azure.  I wouldn’t be surprise to see them fuse with the App Services shortly after that.

They are also very likely to be part of the upcoming Azure Service Fabrik, Microsoft offering for rapidely building Micro Services.

Integration with Azure Service Bus

message[1]

I’ve been consulting 1.5 years for a customer embarking a journey leveraging Microsoft Azure as an Enterprise platform, helping them rethink their application park.

Characteristic of that customer:

  • Lots of Software as a Service (Saas) third parties
  • Business is extremely dynamic, in terms of requirements, transitions, partnerships, restructuring, etc.
  • Medium operational budget:  they needed to get it pretty much right the first time
  • Little transaction volume

One of the first thing we did was to think about the way different systems would integrate together given the different constraints of the IT landscape of the organization.

We settled on use Azure Service Bus to do a lot of the integrations.  Since then, I worked to help them actually implement that in their applications all the way to the details of operationalization.

Here I wanted to give my lessons learned on what worked well and what didn’t.  Hopefully, this would prove useful to others out there set out to do similar integration program.

Topics vs Queues

The first thing we decided was to use Topics & Subscriptions as opposed to queues.  Event Hubs didn’t exist when we started so it wasn’t considered.

They work in similar ways with one key difference:  a topic can have many subscribers.

This ended up being a really good decision.  It costs nearly nothing:  configuring a subscription takes seconds longer than just configuring a queue.  But it bought us the flexibility to add subscribers along the way as we evolved without disrupting existing integrations.

A big plus.

Meta Data

imageIn order to implement a meaningful publish / subscribe mechanism, you need a way to filter messages.  In Azure Service Bus, subscription filter topic messages on meta data, for instance:

  • Content Type
  • Label
  • To
  • Custom Properties

If you want your integration architecture to have long-term value and what you build today be forward compatible, i.e. you want to avoid rework when implementing new solutions, you need to make it possible for future consumers to filter today’s messages.

It’s hard to know what future consumers will need but you can try populating the obvious.  Also, make sure your consumers don’t mind if new meta data is added along the way.

For instance, you want to be able to publish new type of messages.  A topic might start with having orders published on it but with time you might want to publish price-correction messages.  If a subscription just take everything from the topic, it will swallow the price-correction and potentially blow the consumer.

One thing we standardized was the use of content-type.  A content type would tell what type of message the content is about.  The content-type would actually contain the major version of the message.  This way an old consumer wouldn’t break when we would change a message version.

We used labels to identity the system publishing a message.  This was often useful to stop a publishing loop:  if a system subscribes to a topic where it itself publishes, you don’t want it to consume its own message and potentially re-publish information.  This field would allow us to filter out message this way.

Custom Properties were more business specific and the hardest to guess in advance.  It should probably contain the main attributes contained in the message itself.  For an order message, the product ID, product category ID, etc.  should probably be in it.

Filtering subscription

filter[1]Always filter subscription!  This is the only way to ensure future compatibility.  Make sure you specify what you want to consume.

Also, and I noticed only too late while going into production:  filtering gives you a massive efficiency boost under load.

One of the biggest integration we developed did a lot of filtering on the consumer-side, i.e. the consumer C# code reading messages would discard messages based on criteria that could have been implemented in the filters.  That caused the subscriptions to catch way more messages than they should and take way more time to process.

Filtering is cheap on Azure Service Bus.  It takes minutes more to configure but accelerate your solution.  Use it!

Message Content

You better standardize on the format of messages you’re going to carry around.  Is it in XML, JSON, .NET binary serialized?

Again you want your systems to be decoupled so having a standard message format is a must.

Automatic Routing

There is a nice feature in Azure Service Bus:  Forward To.  This is a property of a subscription where you specify which topic (or queue) you want every message getting into the subscription to be routed to.

Why would you do that?

Somebody had a very clever idea that turned out to pay lots of dividend down the road.  You see, you may want to replay messages when they fail and eventually fall in the dead letter queue.  The problem with a publish / subscribe model is that when you replay a message, you replay it in the topic and all subscriptions get it.  Now if you have a topic with say 5 subscriptions and only one subscription struggles with a message and you replay it (after, for instance, changing the code of the corresponding consumer), then 4 subscriptions (previously successfully processing the message) will receive it again.

So the clever idea was to forward messages from every subscriptions to other topics where they could be replayed.  Basically we had two ‘types’ of topics, topic to publish messages and topic to consume messages.

Semantic of Topics

While you are at it, you probably want to define what your topics represent.

Why not put all messages under one topic?  Well, performance for one thing but probably management at some point.  At the other end of the spectrum, why not one topic per message type?

Order.

Service Bus guaranties order within the same topic, i.e. messages will be presented in the order they were delivered.  That is because you can choose to consume your messages (on your subscription) one by one.  But if messages are in different topics, you’ll consume them in different subscription and the order can be altered.

If order is important for some messages, regroup them under a same topic.

We ended up segmenting topics along enterprise data domains and it worked fine.  It really depends what type of data transits on your bus.

Multiplexing on Sessions

imageA problem we faced early on was due to caring a bit too much about order actually.

We did consume one message at the time.  That could have performance issues but the volume wasn’t big, so that didn’t hit us.

The problems started when you encounter a poison message though.  What do you do with it?  If you let it reach the dead letter queue then you’ll process the next message and violate order.  So we did put a huge retry count so this would never happen.

But then that meant blocking the entire subscription until somebody got tired and looked into it.

A suggestion came from Microsoft Azure Bus product team itself.  You can assign a session-ID to message.  Messages with the same session-ID would be grouped together and order properly while messages from different session can be process independently.  Your subscription needs to be session-ful for this to work.

This allowed us to have only one of the session to fail and the other messages to kept being processed.

Now how do you choose your session-ID?  You need to group messages that depend (order-wise) on each other together.  That typically boils down to the identifier of an entity in the message.

This can also speedup message processing since you are no longer bound to one-by-one.

After that failing messages will keep failing but that will only hold on correlated messages.  That is a nice “degraded service level” as opposed to completely failing.

Verbose Message Content

One of the thing we changed midway was the message content we passed.  At first we use the bus to really send data, not only events.

There are advantages in doing so:  you really are decoupled since the consumer gets the data with the message and the publishing system doesn’t even need to be up when the consumer process the message.

It has one main disadvantage when you use the bus to synchronize or duplicate data though:  the bus becomes this train of data and any time you disrupt the train (e.g. failing message, replaying message, etc.) you run into the risk of breaking things.  For instance, if you switch two updates, you’ll end up having old data updated in your target system.  It sounds far fetched but in operation it happens all the time.

Our solution was to simply send identifiers with the message.  The consumer would interrogate the source system to get the real data.  This way the data it would get would always be up to date.

I wouldn’t recommend using that approach all the time since you lose a lot of benefits from the publish / subscribe mechanism.  For instance, if your message represents an action you want another system to perform (e.g. process order), then having all the data in the message is fine.

Summary

This was the key points I learned from working with the Azure Service Bus.

I hope they can be useful to you & your organization.  If you have any question or comments, do not hesitate to hit the comments section!

Nuget WordPress REST API – Authentication

wordpress_logo[1]I use WordPress.com as my blog platform.  It hosts the WordPress CMS software and adds a few goodies.

I was curious about their API after noticing that my Blog App (Windows Live Writer) tended to create duplicate of pictures, leaving lots of unused assets in my Media library.  This really is a personal pet peeve since I’m still at less than %5 of my asset quota after 5 years.

There happens to be two APIs in WordPress.com.  The old XML RPC API, used by Windows Live Writer actually, and the new REST API.

The new API is what people would call a modern API:  its authentication is OAuth based, it is RESTful and has JSON payloads.

Surprisingly there didn’t seem to be any .NET client for it.  So I thought…  why not build one?

Enters WordPress REST API Nuget package.  So far, I’ve implemented the authentication, a get-user and a part of a search-post.

For the search-post, I took the not-so-easy-path of implementing a IQueryable<T> adapter in order to expose the Post API as a Linq interface.  I’ll write about that but for an heads-up:  not trivial, but it works and is convenient for the client.

I will release the source code soon, but for the moment you can definitely access the Nuget package.

You can trial the client on a site I’m assembling on https://wordpress-client.azurewebsites.net/Warning:  I do not do web-UI so the look-and-feel is non-existing Winking smile

Here I’ll give a quick how-to using the client.

Authentication

WordPress.com has the concept of application.  If you’re steep in Claim based authentication, this is what is typically referred to as a relying party.  It is also equivalent to an application in Azure Active Directory.

You setup application in https://developer.wordpress.com/apps/.  The three key information you need in order to get a user to authorize your application to access WordPress.com are:

  1. Client ID:  provided by WordPress.com, the identifier of your application
  2. Client Secret:  also provided by WordPress.com, a secret it expects you to pass around
  3. Redirect URL:  provided by you, where WordPress will send the user back after consent is given

Here is the authorization flow:

image

# Description
1 The user clicks on a ‘sign in’ link from your web site.
2 Your web redirect the user’s browser to a WordPress.com site passing the client-ID of your application and the return-url you’ve configured.  The URL will be:https://public-api.wordpress.com/oauth2/authorize?client_id=&lt;your value>;redirect_uri=<your value>;response_type=code
3 Assuming the user consent for your application to use WordPress.com, the user’s browser is redirected to the Redirect URL you provided to WordPress.com.  In the query string, your application is given a code.  This code is temporary and unique to that transaction.
4 Your application can now contact directly (without the browser) the WordPress.com API to complete the transaction.  You POST a request tohttps://public-api.wordpress.com/oauth2/token

You need to post the code, the client-ID and other arguments.

5 The API returns you a token you can use for future requests.
6 For any future request to the API, you pass the token in the HTTP request.

Now, this is all encapsulated in the WordPress REST API Nuget package.  You still need to do a bit of work to orchestrate calls.

The link to the authorization page you need to redirect the end-user to can be given by:

static string WordPressClient.GetUserAuthorizeUrl(string appClientID, string returnUrl)

You pass the client-ID of your application and its return-url and the method returns you the URL you need to redirect to user to (step 2).

Then on the return-url page, you need to take the code query string parameter and call

static Task<WordpressClient> WordPressClient.GetTokenAsync(string clientID, string clientSecret, string redirectURL, string code)

This method is async.  All methods interacting with WordPress API are async.  The method returns you an instance of the WordPressClient class.  This is the gateway class for all APIs.

That was step 4 & 5 basically.

Rehydrating a WordPress Client between requests

That is all nice and well until your user comes back.  You do not want them to authorize your application at every request.

The typical solution is to persist the token in the user’s cookies so that at each request you can recreate a WordPressClient object.

For that you can access the token information in

TokenInfo WordPressClient.Token { get; }

When you want to recreate a WordPressClient, simply use its constructor:

WordPressClient(TokenInfo token)

Getting user information

Just as an example of how to use the API beyond authorization, let’s look at how to get information about the user.

Let’s say the variable client is a WordPressClient instance, then the following line of code

var user = await client.User.GetMeAsync();

gets you a bunch of information about your end-user profile on WordPress.com, such as their display name, the date the user join the site, their email, etc. .  This methods wraps the API operation https://developer.wordpress.com/docs/api/1.1/get/me/.

Summary

This was a quick run around this new WordPress REST API Nuget package I just created.  I’ll put it on Codeplex soon if you want to contribute.

Corporate Cultures

It is said that Netflix represents the new I.T. corporation well.

Netflix-Xbox360-HiRes[1]

If you are interested in seeing what their corporate culture looks like, have a look a the slide deck they show to their job candidates.

It has all the flair of the typical Silicon Valley shop with their “we know better than those bunch of twits” attitude, but their critics and response to “normal” corporation’s values is a good read.

 

But if you want something even more drastic, check-out Valve’s.  Valve Corporation (Video Game distributor, e.g. Half-Life) implements a drastic departure from normal corporation, a flat organization.  Their employee manual has an even more pamphlet feel to it.

 

In the same vein but in a Brazilian company not related to IT, a good watch is the following TED Talk Video.  Semco CEO Ricardo Semler gives a very inspiring talk about how he deconstructed the “boarding school” aspects of his company, by removing arrival time, desk location, even salary, out of management hands with apparently great success:  his company’s revenue grew many folds under his lead.

 

Enjoy!