Machine Learning – An introduction – Part 1

As I mentioned before, I did specialize (through graduated studies) in Machine Learning only to drop the field after a few years of trial on the Marketplace.  I felt the field wasn’t ready for prime industrial applications.

011215_0459_TwentyYears1.jpgYears have past, the field has matured and now is an exciting time to be working in Machine Learning!  The possibilities have far outgrown the labs where they were born.

Yet, it still is a quite complex field being at the intersection of statistics, data analysis and, if you want to do it right, Big Data.

Before diving into Azure Machine Learning, I wanted to first give an overview of what Machine Learning is.  My favourite 10 minutes story is an example that is simple enough to grasp without prior ML knowledge:  few dimensions, few data points, simple ML algorithm.

In ML parlance, I’m going to give a linear regression example but you do not need to know about that to understand it.

The example

Machine learning is all about building models.  What is a model?  A model is a simplified version of reality:  it “embodies a set of assumptions concerning the generation of the observed data, and similar data from a larger population” (wikipedia).

In my example, we are going to predict the weight of person given its height.

We are going to build a model and that model will be able to predict the weight you should have if you measure 6 feet tall.

We already made quite a few assumptions.  We assumed the weight of a person is dependant on its height.  Written in mathematics:

weight = f(height)

That might remind you of a typical formulation

y = f(x)

But we’ll go further and assume that the weight has a linear relationship (y = m*x + b) with the height:

weight = m*height + b

Now of course, this is a very simplified model, a very naïve one.  There are many reasons why you might think this model is incomplete.  First, it doesn’t include a lot of variables, for instance, the age, the gender, nationality, whatever.  It’s alright, it’s a model.  Our goal is to make the best out of it.

Let’s look at some sample data I found on the web.  I’ve entered the data in my #1 analysis tool, Excel, and plotted it:

image[7]

We can see a nice cloud of data and we could guess there is sort of a linear relationship between the height and the weight.  That is, there is a line sort of carrying this cloud.  Now the question I’ll ask you is:  where would you put the line?

image[11]

I’ve hand drawn 4 lines.  The two bottom ones aren’t very convincing, but what about the two top ones?  Which one would you choose and why?  What criteria do you use?

The mathematical problem

y = m*x + b:  given a x, we can compute a y.  We have an independent variable, x, the height and a dependant variable, y, the weight.

We also have parameters:  the slope, m and the origin, b.

The model is described by its parameters.  Guessing what the line should be is guessing its slope and origin.

We also have sample data:  a data set of sample x’s & y’s.  We want to use that sample to deduce the parameters:

parameters = F(sample data)

This is the learning in Machine Learning.  We are showing examples of correct predictions to an algorithm (a machine) and we want the algorithm to figure out what is the best model to predict them all with the minimum number of errors and also to be able to predict new data it has never seen.

Simple enough?  What is the recipe?

Basically, we are going to consider many models (many values of m & b) and compare them using a cost function to select the best.  We will use a cost function to evaluate models.

A cost function of a given model on a data set is the sum of the cost function applied to each point in the data set:

Cost(model, data set) = Sum of Cost(model, point) over all points

In our example, an intuitive cost function would be the distance of a sample point to the line.  After all, we want the line to be “in the thick” of the cloud, ideally (impossible here) the points should all lie on the line.  The distances is represented by the green lines on the follow graph (I’ve just plotted the first 3 data points for clarity).

image

To make a story short, a more tractable cost function is the square of the distance measured on the y-axis:

Cost(model, {x, y}) = (predicted(x) – x)^2

If you are curious why, well, squares are easier to tackle mathematically than absolute values while yielding the same result in optimization problems.

Putting it all together, the machine learning problem is:

Find m & b that minimizes the sum of (m*x + b – y)^2 for all {x, y} in the data set

I formulated the problem in terms of our example, but in general, a machine learning is an optimization problem:  minimizing or maximizing some function of the sample set.

It happens that the problem as stated here can be resolve by linear algebra analytically, i.e. we can find the exact solution without approximation.

I won’t give the solution since the details of the solution aren’t the point of the article.

Summary

Let’s recapitulate what we did:

  1. We chose a prediction we wanted to make:  predict the weight of a person
  2. We chose the independent variables, only one, the height, and the dependant variables, only one, the weight
  3. We found a sample set to learn from
  4. We posited a class of models:  linear regressions with slope and origin as parameters
  5. We chose a cost function:  the square of the difference between sample values and predictions
  6. We optimize (minimize in this case) the sum of the cost function to find the optimal parameters

We end up with the optimal values for m and b.  We therefore have our model, f(x)=m*x+b, and we can make prediction:  for any x we can predict f(x), i.e. for any height we can predict the weight.

We used examples to infer rules.  This is Machine Learning in a nutshell.  We let the Machine extract information from a sample set as opposed to trying to understand the field (in this case biology I suppose) and trying to derive the rules from that understanding.

MachineLearningOverview

I hope this gave you an overview of what Machine Learning is, what type of problems it is aiming at saving and how those problems are solved.

In the next entry, I’ll try to give you an idea of what more realistic Machine Learning problems look like by adding different elements to the example we have here (e.g. number of variables, complexity of the model, splitting sample data into training and test sets, etc.).

Azure DocumentDB – Performance Tips

dt-improved-performance[1]Azure DocumentDB has been released for a little while now.  Once you get passed the usual step of how to connect and do a few hello worlds, you will want to reach for more in-depth literature.  Sooner or later, performance will be on your mind when you’ll want to take architecture decision on a solution leveraging Azure DocumentDB.

Stephen Baron, Program Manager on Azure DocumentDB, has published a two-parters performance tips article (part 1 & part 2).

The tips given there are quite useful and do not require to rewrite all your client code.  They cover:

  • Network Optimization
  • How to better use the SDK
  • Indexing
  • Query optimization
  • Consistency settings

It is very well written, straightforward and therefore useful.  It is my reference so far.

Azure Data Factory Editor (ADF Editor)

Azure Data Factory is still in preview but obviously has a committed team behind it.

networking[1]

When I looked at the Service when the preview was made available in last Novembre, the first thing that stroke me was the lack of editor, of designing surface.  Instead, you had to configure your factory entirely in JSON.  You could visualize the resulting factory in a graphical format, but the entry mode was JSON.

Now don’t get me wrong.  I love JSON as the next guy but it’s not exactly intuitive when you do not know a service and as the service was nascent, the document was pretty slim.

Obviously Microsoft was aware of that since a few months after, they released a light-weight web editor.  It is a stepping stone in the right direction.  You still have to type a lot of JSON and there are no, but the tool provides templates and offers you some structure to package the different objects you use.

It isn’t a real design surface yet, but it’s getting there.

I actually find this approach quite good.  Get the back-end out there first, get the early adopters to toy with it, provide feedback, improve the back-end, iterate.  This way by the time the service team develop a real designer, the back-end will be stable.  In the mean time the service team can be quite nimble with the back-end, avoiding to take decision such as “let’s keep that not-so-good-feature because it has too many ramifications in the editor to change at this point”.

I still haven’t played much with ADF given I didn’t need it.  Or actually, I could have used it on a project to process nightly data but it wasn’t straightforward given the samples at launch time and the project was too high visibility and had too many political problems to add an R&D edge to it.  Since then, I am looking at the service with interest and can’t wait for tooling to democratize its usage!

Azure Key Vault – Pricing

Azure Key Vault is an Azure packaged service allowing you to encrypt keys and small secrets (e.g. passwords, SAS) and manage them in a secure fashion.  Azure Key Vault actually allows you to store cryptographic keys and do operations with them (e.g. encrypt data) without revealing the key, which is pretty cool.  Check it out.

Now a common question when new services come along is:  how much does it cost?  We all remember looking at the API management and saying:  I’m gona put that in front of all my APIs until we reached the price tag and back-paddled!

With Key Vault, things are pretty cheap.  Let’s look at the pricing.

Unless you go HSM protection (in which case you probably work for a Financial institution, in which case the pricing likely isn’t going to be the show stopper), Azure charges you by operation:

$0.0159 / 10 000 operations

Well, that’s pretty cheap unless…  you implement access to the vault just before the implementation of every secret usage.  Say you have an app that generates a temporary SAS for each image it displays and you access the vault to get your hand on the Storage Account Primary key every single time, well…  if your site is busy, you’re gona end up paying for those operation.

But if you do that, you’re gona slow down your site anyway since you REST API over HTTPS at every image you want to display, a bad idea in the first place.

 

My recommandation:  cache the recovered secrets for a short duration of time, say 5 minutes.  This way you won’t be penalized in term of performance nor in pricing.

Azure Key Vault – Step by Step

Azure Key Vault is an Azure packaged service allowing you to encrypt keys and small secrets (e.g. passwords, SAS) and manage them in a secure fashion.  Azure Key Vault actually allows you to store cryptographic keys and do operations with them (e.g. encrypt data) without revealing the key, which is pretty cool.  Check it out.

A typical problem with those new services is the lack of documentation.  Well, no more for the Key Vault, thanks to Dan Plastina step by step guide on Key vault.  It’s succinct, straight to the point and well written.

The guide’s backbone is the vault’s lifecycle:

8875.KeyVaultLifecycle3[1]Now this basically allows you to go to town with the vault.  It’s a very clean workflow that enables many scenarios.

The typical ones I would see are:

  • Secret owner creates a bunch of secrets for SAS (e.g. storage accounts, Service Bus) and allows some applications to have access to them
  • Application access those secrets via REST API
  • Those secrets are refreshed directly in the vault by the secret owner
  • Secrets can be refreshed on Schedule by a web job
    • e.g. using a storage account key, itself another secret refreshed manually (or by another process), the web job recreates SAS valid for 40 days every 30 days
  • Auditor can check that access is conform to establish design

This is so much cleaner than what I see today in the field where SAS are created, put in web.config, forgotten there until they expire and shared between developers who troubleshoot problems in production, etc.  .  Because of the work involved trying to cycle the SAS, those SAS are usually created with multi-years validity, so if they get compromised, well, you get the picture.

The nice thing here is that the vault is doing more than protecting secrets:  it allows you to manage them centrally.  For me that is half the value.  Especially if you have an application park bigger than 2 apps sharing some secrets.  It gives you a visibility of which secrets are used by whom and allows you to manage them.  You do not need to have SAS that last for 5 years anymore since you can cycle them centrally.

Azure Key Vault

Has somebody been peeking on my X-mas list?

Indeed, one of the weakness of the current Azure Paas solution I pointed out last year was that on non-trivial solutions you end up with plenty of secrets (e.g. user-name / password, SAS, account keys, etc.) stored insecurely in your web.config (or similar store).

I was suggesting, as a solution, to create a Secret Gateway between your application and a secret vault.

Essentially, Azure Key Vault fulfils the secret vault part and a part of the Secret Gateway too.

Azure Key Vault, a new Azure Service currently (as of early February 2015) in Preview mode, allows you to store keys and other secret in a vault.

One of the interesting feature of Azure Key Vault is that, as a consumer, you authenticate as an Azure Active Directory (AAD) application to access the vault and are given authorization as the application. You can therefore easily foresee scenarios where the only secret stored in your configuration is your AAD application credentials.

The vault also allows you to perform some cryptographic operation on your behalf, e.g. encrypting data using a key stored in the vault. This enables scenarios where the consuming application never knows about the encrypting keys. This is why I say that Azure Key Vault performs some functions I described for the Secret Gateway.

I see many advantages of using Azure Key Vault. Here are the ones that come on the top of my head:

  • Limit the amount of secrets stored in your application configuration file
  • Centralize the management of secrets: a key is compromised and you want to change it, no more need to chase the config files storing it, simply change it in one place in the vault.
  • Put secrets at the right place: what is unique to your application? Your application itself, i.e. AAD application credentials. That is in your app config file, everything else is in the vault.
  • Audit secret access
  • Easy to revoke access to secrets
  • Etc.

I think to be air tight, the Secret Gateway would still be interesting, i.e. an agent that authenticates on your behalf and return you a token only. This way if your application is compromised, only temporary tokens are leaked, not the master keys.

But with Azure Key Vault, even if you do not have a Secret Gateway, if you compromise your master keys you can centrally rotate them (i.e. change them) without touching N applications.

I’m looking forward to see where this service is going to grow and certainly will consider it in future Paas Architecture.