SOA vs Mobile APIs

I recently read an article from Bill Appleton of Dream Factory with the provocative title SOA is not a Mobile Backend.

It raised quite a few good points that were in the back of my mind for more than a year.

Basically, what is the difference between SOA and API?

inclusion-229302_640[1]To an extent it is largely the domain of the buzzword department but as you think about it, it is more profound.

SOA really is an Enterprise Creature.  It’s a system integration strategy (despite what SOA purist will tell you).  As Bill mentions in his article, SOA also typically comes with its heavy Enterprise artillery:  Enterprise Service Bus, XML Message Translation, Service Locator, etc.  .  But it also comes with a set of useful practices:  domain knowledge, reusable services, etc.  .

API is an internet beast.  How do you interact with a service on the cloud?  You talk to its API.  API are typically simpler in terms of protocols:  HTTP, REST, JSON, simple messages, etc.  .  They are also messy:  is an API about manipulating a specific entity or doing a consisting set of functionalities?

To me, they spawn from the same principles, i.e. standard interface to exchange information / commands in a loosely couple way between remote components.  SOA is the Enterprise & earlier result of those principles.  API is the internet / mobile later result.

SOA was tried by some big enterprises, forged by comity with expensive consultants and executives trying to solve the next 10 years problem.

Integration_logo[1]API was put forward by a myriad of small companies and consumed by even more entities.  They figured out the simplest way to expose / consume services quickly and learned from each other.  In a few years a few set of practices were observed and documented and standards are even emerging.

Bill, in his article, contrasts the approaches in a way that reminds me of the old SOA debate of top-bottom vs bottom-top approaches, that is, do you discover your services by laying down your business processes and drilling down discover a need for services or by looking at your applications and which services they expose and hope that one day you can reuse them?

There is a lot of that in the issues brought by Bill around APIs.  Like in SOA if you just spawn new APIs ‘On demand’, you’ll end up with a weird mosaic with overlapping concepts or functionalities.  I agree that practices developed from SOA can definitely helped.  Service Taxonomy, for instance, forces you to think of how your services will align and where their boundaries will be drawn before you start.

But for an organization, I believe it is nearly a forced therapy to implement one or two APIs, experiment them in full operation before you can start having serious discussion around other SOA aspects.  Once you’ve tried it, you can have a much more informed discussion about what changes and at which in a service (while discussing versioning), what type of security rules make sense and a bunch of other aspects.

Otherwise you fall victim of the good old analysis paralysis and will host meetings after meetings for deciding about something because everyone has a different a priori perspective on it.

 

So my suggestion is yes, API are a little messier, but experimenting with them, even if you end up with a lot of rework, will bring much value to your organization.  So go, and create simple API, expose them and operate them!

Azure SQL Datawarehouse

Documentation on Azure SQL Datawarehouse, the new Azure Datawarehouse managed service, is quite thin.

The online documentation, as of today (24/07/2015), consists of 3 videos and a blog post.

Here is what I gathered.

db[1]One of the great characteristic of the offering is the separate Storage & Compute billing.

You can indeed have data crunches on periodic basis (e.g. end-of-months) and pay only for those, while keeping your data up-there at the competitive price of Azure Storage.

Compute is said to be ‘elastic’, although it isn’t automatic. You can change the number of compute horsepower associated to your Datawarehouse instance without having to rebuild it, so it’s done on the fly, very quickly.  Nevertheless, for that you need to manipulate the portal (or Powershelling it), so it’s not auto-elastic on-demand.

There is a nice integration with Azure Data Factory to import data into it.

The service already is in preview, you  can access it on the new portal.
Azure SQL Datawarehouse looks very promising although the documentation available is slim at this point in time.

The most details I found was actually through SQL Server Evolution video which is a nice watch.

SQL Server 2016

Here’s a rundown of my favourite new features in SQL Server 2016, largely inspired by the SQL Server Evolution video.

Impact of Cloud-First on SQL Design

This is a really nice illustration of the consequences of Cloud-First for Microsoft products.

4150.Microsoft-Mobile-First-Cloud-First[1]SQL has been basically flipped around as a product.  When SQL Azure was introduced years ago, it was a version of SQL Server running in the cloud.  Nowadays, SQL Azure drives the development of the SQL Server product.

Being cloud-first allows Microsoft to iterate much faster on different product feature in preview mode, gather a tone of feedback, thanks to the scale of Azure and deploy it (when the development cycles are over) globally very quickly.  That changes the entire dynamic of product development.

The nice thing is that it actually improves the SQL Server product:  when a new version of SQL Server comes in, e.g. 2016 right now, the features have been explored by an order of magnitude greater user base than in beta-test in the old world.

In-memory Columnstore indexes

In-memory OLTP was introduced in SQL Server 2014.  SQL Server 2016 adds another twist:  Columnstore indexes on in-memory tables!

IC709594[1]You can now have an high throughput table being fully in-memory and also have it optimized for analytics (columnstore indexes).

This unlocks scenarios such as real-time analytics (with no ETL to Data Warehouses)

Always Encrypted

Transparent Data Encryption (TDE, only recently added to Azure SQL) encrypts the data on disk.  This addresses mainly the physical compromising of data:  somebody steals your hard drive.

encryption[1]Not a bad start as often data center hard drives still get recycled and you’ll see headlines of data leaks caused by hard drives found with sensitive data on it once in a while.

Now what Always Encrypted brings to the table is the ability to encrypt data in motion, i.e. while you are reading it…  and it is column based so you can very selectively choose what gets encrypted.

With this an SQL Database administrator will only see rubbish in your encrypted columns.  Same thing for somebody who would eavesdrop on the wire.

The cryptographic keys aren’t stored in SQL Database either but on the client-side (Azure Vault anyone?) which means that even if your entire database gets stolen, the thief won’t be able to read your encrypted data.

…  it also mean you’ll have to be freaking careful on how you manage those keys otherwise you’ll end up with encrypted data nobody can read (did I mention Azure Vault?)

Polybase

Polybase was introduced in SQL Server Parallel Data Warehouse (PDW).

It extends the reach of TSQL queries beyond SQL Server tables to unstructured data sources such as Hadoop.

This will now be part of SQL Server 2016.

Run ‘R’ models in SQL

200px-R_logo.svg[1]Microsoft recent acquisition of Revolution Analytics didn’t take long to have impact.

We will be able to run ‘R’ analytics model right into SQL Server 2016.  This brings the power of R Data Analysis right so much closer to the data!

Advanced Analytics

Power BI is of course at the center of this but also a complete revamp of Reporting Service.

Stretching SQL Server to Azure

A hybrid-cloud approach to SQL Azure:  put your cold data (typically historical) in the cloud storage but keep your database on-premise.

I’m talking about data within the same table being both on-premise and in the cloud!

Quite easy to setup, this feature has potential to be a really nice introduction to the cloud for many organization.  The high value scenario is to drop the storage cost for on-premise application having huge database with most of the data being historical, i.e. cold (rarely accessed).

It is all configuration based, hence not requiring any changes in the consuming applications.

Wrapping up

SQL Server 2016 proves that SQL product line is alive and kicking with very valuable features for modern scenario, be it Big Data, Advanced Analytics or Hybrid Cloud computing.

You can try SQL Server 2016 using the VM template SQL Server 2016 CTP2 Evaluation on Windows Server 2012 R2 in the Marketplace as of this date (22/07/2015).

If you want more details, so far, the best source is that SQL Server Evolution video which is well done.

Azure ML – Simple Linear Regression

Now that we got the basics of Machine Learning out of the way, let’s look at Azure Machine Learning (Azure ML)!

In this blog, I will assume you know how to setup your workbench.

In general, there are quite a few great resources for Azure ML:

On the agenda:  I will take the sample set used in my previous post, perform a linear regression on it and validate that all I said about linear regression and Machine Learning is true.

This blog was done using Azure ML in mid-July 2015.  Azure ML is a product in evolution and the interface will certainly change in the future.

Create Data Source

First let’s get the data.

I will work with the same data set than in the previous articles that can be found here.  It is about the relationship between heights and weights in humans.  We will try to predict the weight given the height of different individuals.

For the point of this blog, having 200 data points is plenty and will ease the manipulation.  Let’s cut and paste their table into an Excel spreadsheet and save it into a CSV format.

Yes, believe it or not, Azure ML cannot load native Excel files!  So you need to format it in CSV.

We will create an Azure ML Data set from it.  In Azure ML workbench, select the big plus button at the bottom left of the screen, then select Data Set, then select From Local File

image

Select the file you just saved in Excel, and OK that.

The experiment

Click the big plus button again (at the bottom left of the screen), then select Experiment:

image

and select Blank Experiment:

image

The Workbench will present you with a sort of experiment template.  Don’t get too emotionally attached:  it will disappear once you drop the first shape in (which will be in 30 seconds).

Right off the bat, you can change the name of the experiment in order to make it easier to find.  Let’s call it Height-Weight.  Simply type that in the canvas:

image

You should see your data set under “My Data Set” on the left pane:

image

Your data set will have the same name you gave it, or by default the CSV file name you uploaded.

Let’s drag the data set onto the canvas.

image

In the “search experiment items”, let’s type ‘project’

image

Then we can select Project Columns and drop it on the canvas

image

We then have to link the two shapes.  In a rather counter-intuitive way, you have to start from the Project Columns shape towards the data set shape.

image

Why do I want to project column?  Because I do not want to include the index column in.  Actually I did that while preparing this blog entry and the index column was used by the regression and gave bizarre results.

Let’s select the Project Columns shape (or module).  In the properties pane, on the right, you should be able to see a Launch column selector.  Well, let’s select the selector.

image

We then simply select the two columns that has meaning:  height and weight.

image

Model

We are now going to train a model so let’s drop a Train Model module in there.  If you followed so far, yes, type “train model” in the module search box, select train model and drop it under Project Columns.

Our model will be a linear regression so let’s find that module too.  There are a few type of linear regression modules, today we’ll use the one named “Linear Regression”, found under “Regression”.

Let’s link the module this way:

image

We need to tell the model what variable (column) we want to predict.  Let’s select the train model module which should allow us to launch the column selector (from the properties pane).

You should only have the choice between weight and height.  Choose weight.

You should notice that you do not see the index column.  That’s because we projected it out basically.

Now you can run the whole thing:  simply click the Run button at the bottom of the screen.

It takes a little while.  You’ll see a clock icon on your different module until it becomes a green check as they all get run.

Result

Wow, we have our first linear regression.  What should we do with it?

Let’s plot its prediction against the data in Excel.  First, let’s find the computed parameters of the model.

Let’s right click on the bottom dot of the train model module and select View Results.

image

At the bottom of the screen you should have those values:

image

Remember the formula f(x) = m*x + b?  Well, the bias is b while the other one (oddly name I should add) is the slope m.

In Excel, we can punch the following formula:  =-105.377+3.42311*B2 and copy it for every row.  Here I assume the A column is the index, B is the “Height(Inches)”, C is the “Weight(Pounds)” and D is the column where you’ll enter that formula.  You can add a title to the column “Computed”.

You spreadsheet should look like:

image

You can see that the computed value isn’t quite the value of the weight but is in the same range.  If you plot all of that you should get something like:

image

The blue dots is the data while the orange ones are the prediction.

You can see that the line is passing through the cloud of data, predicting the data as well as a line can do.

Exporting predictions

Maybe you found the way I extracted the model parameter to then enter an equation in Excel a bit funny.  We could actually ask AzureML to compute the prediction for the data.

For that, let’s drop a Score Model module on the canvas and link it this way:

image

Basically we are using the model on the same data that was used to train it (output from the Project Columns module).

Let’s run the experiment one more time (run button at the bottom of the screen).  We can then right click on the bottom dot of the Score Model module and select View Results.

image

You can then compare the values in the Scored Labels column to the one in the computed columns in Excel and see they are the same.

image

We could have exported the results using a writer module, but it does require quite a few configuration to do.

Summary

I assume very little knowledge of the tool so this blog post was a bit verbose and heavy in images.

My main goal was to show you concrete representation of the concepts we discussed before:

  • Independent / Dependent variable in a data set
  • Predictive model
  • Linear Regression
  • Optimal parameters through training

The cost function was implicit here as an option in the Linear Regression module.  You can see that by clicking on the module.

image

Another good introduction to AzureML is the Microsoft Virtual Academy training course on Machine Learning.

Team IQ

iStock_team_magnifying_glass_000012495968XSmall[1]

I’m still catching up on the articles I read months ago and wanted to share with you!

This one has been published in the New York Times and is about the intelligence of a team or the team IQ.  It asks why some teams are smarter than others?

It starts by stating two fundamental truths most of us know:

  1. In the workplace (and elsewhere) today, most decisions are taken by a group of people rather than individual.
  2. A lot of teams are dysfunctional or at least feel sub-optimal to most of their members

The authors studied a specific attribute of a team, its intelligence, its cognitive ability.  They published studies (from the famous Carnegie Mellon University and M.I.T.) where they compared different teams and tried to see what, in its team members, influence a team collective intelligence.

IQ[1]Surprisingly, it isn’t the sum of the members I.Q.!

Instead, three characteristics had the highest correlation factor:

  1. Each team member contributed more equally to the team discussion (as opposed to having 1 or 2 strong voices and the rest following)
  2. Team members scored high on the “Reading the Mind in the Eyes” test (more on that below)
  3. Team had more women

So the “Reading the Mind in the Eyes” test is about the ability to read complex emotions from the eyes of individuals.  It is also called social intelligence.  You can pass the test here!

The interesting bit about the women element is that it isn’t about diversity (i.e. having a well balanced men-women team) but about having more women onboard!  There is a strong correlation between the 2nd and 3rd characteristic because apparently women score, in average, higher on the “Reading the Mind in the Eyes”.  Although, as another revelation of the study, emotion reading is as important for online / virtual teams as for face-to-face one.  So it isn’t just about reading faces but reading and understanding emotions in general.

As you can tell by my blog site’s title, I work in I.T. and this is where the scientific aspect of the article (i.e. the study involving multiple teams) was key.  Because…  I seldomly experiment teams with women!  You see, I.T. has the same concentration of women by squared meters than a tavern ;)

So there you go.  If you want your team to be cleverer, add women into it.  And for those who are frustrated about the space women take in the workplace, well, keep frustrating because they really bring something to the table and therefore aren’t going anywhere ;)

Machine Learning – An Introduction – Part 2

In a past blog entry I gave an overview of what Machine Learning is.  I showed a simple linear regression example.  The goal really was to explain to newcomer to the field what Machine Learning is, what type of problem it tries to solve and what the general approach is.

Of course, I used an extremely simple example in order to add noise to that main goal.

binary-63530_640[1]

Machine Learning is about extracting information from a sample data set in order to select the optimal model (best set of parameters) fitting the sample set.

MachineLearningOverview

In this blog, I’ll give you a more complete picture by looking at different aspects of more realistic Machine Learning scenarios.

Different learning problems

We looked at one type of learning problems:  regression, i.e. predicting continuous values for dependant variables given values of independent variables.

There are at least two other popular learning problems:  classification & clustering.

Classification problem is a little like regression where you are trying to model f where y = f(x) but instead of having y being a continuous variable, y is discrete, i.e. takes a finite number of values, e.g. {big, small}, {unknown, male, female}, etc.  .

An example of classification problem from bioinformatics would be to take genomic scans (e.g. DNA microarray results) of a patient and predict if they are prone to develop cancer (true) or not (false) based on a sample set of patients.  The dependant variable here would take a Boolean value:  {true, false}.

Clustering problem, on the other hand, consists in predicting a class without having classes in the sample set.  Basically, the algorithms try to segment the data automatically, as opposed to learning it as in classification problems.

An example of clustering, in marketing, would be to take customer data (e.g. demographics & buying habits) and ask the system to segment it into 3 groups.

There are, of course, many more learning problems.  For instance time series analysis, where we do not look at static data but at how it varies in time.

Different model classes

We looked at one class of models:  linear regression.

This is a special type of regression.  This is one of the simplest algorithms but also, often the most useful.

Starting from the linear regression model, we could go non-linear, e.g. to polynomial, so instead of having f(x) = m*x + b, we would have

f(x) = a_0 + a_1*x + a_2*x^2 + a_3*x^3 + … + a_n*x^n

For a given degree n.  The bigger the degree, the more the function can curve to fit the sample set as shown in the following examples:

image

Another popular class of models are the neural networks.  They too have variable number of parameters and also have varying topology (single layer, multi-layer, deep, with feedback loops, etc.).

Classification problems give rise to different models.  A linear separator model actually borrow a lot of the conceptual of a linear regression:  imagine that a line is used not to predict points but to separate them in two categories.  Same thing could be said for a lot of regression models.

The most popular model (and the simplest) for clustering is k-means.

Dimensions

Most real life problem are not about mapping one independent variable on one dependant one like we did in the previous post.  There are multiple dimensions, both as independent and dependant (predicted) variables.

hypersphere_by_enigmista[1]

Multiple dimensions bring two complexities to the table:  visualization & curse of dimensionality.

Having more than 2 or 3 dimensions makes visualization non trivial.  Actually, some data analysis / simple machine learning methods (e.g. linear regression) can be useful to do a first analysis on the data before choosing what type of ML model to apply to it.

Most models are geared to work in multiple dimensions.  The general linear model, for instance, works in multiple dimensions.

When the number of dimension is high, a well known set of phenomena, refer to as the curse of dimensionality, occurs.  Basically, our intuition in low-dimensions (e.g. 2, 3) fails us in high dimension.

To give only one example of such phenomena, if you take an hyper-sphere of dimension d, the distribution of points within the sphere changes drastically as d increases.  With d=2, a circle, if you pick a point at random within the circle, the distance between the point and the center of the circle will have a very roughly uniform distribution.  When d increases, the volume of the hyper-sphere concentrates closer to the surface, making the distribution extremely high for distances close to the radius of the hyper-sphere and close to zero towards the middle.  Can you picture that in your head?  I can’t and that’s why high dimensions is tricky to work with.

A lot of techniques fail to work at high dimension because of that.

Testing set

An importing activity someone wants to do with a model is to validate how optimal it is.

Since we select the optimal model for a given sample set, how much optimal can it get?  Put otherwise, how can we measure the quality of the model?

A typical way to do that is to split the available sample set in two sets:

  • A learning data set
  • A testing data set

The learning data set is used to learn / optimize the model’s parameters while the testing data set is used to validate the output model.

The validation is often done by computing the cost function over the testing set.  This is a good way to compare two model classes, e.g. linear regression vs Neural network, to see how they each perform.  Using this kind of validation, one can actually train different class of models and select the best one overall.  Testing also allow someone to avoid over fitting, which is our next topic.

Over fitting

You might have noticed I’ve talked about different models having varying number of parameters and the number of parameters allowing to better fit the training data?  For instance, a polynomial regression:  a linear regression has two parameters (origin and slope), a cubic regression has 4, etc.  .

You might then wonder, why not put the maximum number of parameters in there and let it do its thing?  Yes, there is a catch.  The catch is that if you do that, you will over fit your training data set and your model will be excellent at predicting the data it has seen and terrible at doing any generalization.

As usual, it is easier to explain by showing an example.  Let’s take my example from the previous article.  To make the example obvious, I’ll exaggerate.  I’ll take only two points from the training set (circled in green) and do a linear regression on those.

You see how the top line fits perfectly those two points?  The distance between the points and the line is zero.  The cost function is zero.  Perfect optimization.  But you also see how poorly it predicts the other points compare to the bottom line?  The top line over fits the data.

If I did a split between a training and a test set as discussed in the previous section, I would have been able to measure the poor generalization quality.

This is actually one way to fight the over fitting:  to test the learning algorithm and select one with better generalization capacity.  For instance, I could say I’ll go with polynomial regression but select the polynomial degree using the testing set.

Again, another example with a polynomial regression of high degree (say n>30) fitting perfectly a few points but having poor generalization capacity.

There are two sides at over fitting:  a large number of parameters and a low number of points in the sample set.  One can therefore either lower the number of parameters or increase the data set size, which is the next topic.

It is to be noted that a large number of parameters used with a data set containing an enormous amount of points won’t over fit:  parameters will adjust and be set to zero.  A line is a special kind of cubic, for instance.

Data set size

As hinted in the previous topic, increasing the number of points in a training data set increases the ability of the trained model to generalized.

Data set size have exploded in the last couple of years (yes, Internet basically, but also cheap storage) and that is one of the main reason for Machine Learning renaissance.

This is why combining big data with Machine Learning is key to unleash its potential.

Iteration / Approximation

With linear regression, we can compute an exact solution.  Most models do not allow that.  Typically we iteratively approximate a solution.  Furthermore, for any non-trivial learning problem, there are multiple local optima, which make finding the global optimum all the more difficult.

This would require an entire blog post to scratch the surface but suffice to say that you’ll see a lot of notion of steps or iterations in Machine Learning models.  It has to do with the approximation methods used.

For instance, one of the way to avoid over fitting is to stop the approximation process before it reaches a local optimum.  It can be shown that this improves generalization capacity of the model.

Data Quality

I kept this one for the end, but believe me, it isn’t the least.  It plagues every Machine Learning problem.

If there is one field of science where the saying Rubbish In, Rubbish Out applies, it is in Machine Learning.

Machine Learning models are extremely sensitive to the quality of data.

Real Data Set you take from Databases are rubbish.  They are full of missing data, data has been captured in (undocumented) different conditions and sometimes the semantic of data changes across time.

This is why a lot of the leg work in Machine Learning has little to do with Machine Learning but with Data Cleansing.

Let me give you a real example I stumble upon lately.  We were discussing an Internet of Thing (IoT) problem with a customer.  We were planning to install vibration / temperature sensor on industrial machines to monitor their manufacturing process.  We were thinking about how we could eventually leverage the collected data with Machine Learning.

The key point I made in that conversation was that it would need to be design up front into the IoT implementation.  Otherwise, I can guarantee that different sensors wouldn’t be calibrated the same way and that Machine Learning models would just get lost in the fuzzy data and make poor predictions as a consequence.

The basics of Data Cleansing is to normalize the data (center it around its average and make its standard deviation equal to one), but it goes beyond that.

This is why I see Machine Learning as yet another step in the B.I. pipeline of a company:  first have your transaction data, then aggregate your data, control the quality of your data, then you can analyse and make prediction.

Summary

In the last two blog posts, I’ve tried to give you a taste of what Machine Learning is.  I hope it wasn’t too much of a torrent of information to take in.

It is important to understand that type of basis before jumping into powerful tools such as Azure Machine Learning.

ML is a relatively mature field with more than 30-50 years of age (depending on where you draw the line between Machine Learning, Data Analysis and Statistics), so there is a lot out there.  It is also a young field where progress are made every day.

What makes it exciting today is the capacity we have to gather data and process it so quickly!

If you would like a complementary reading, take a look at Azure Machine Learning for Engineers.  They take machine learning from a different angle and tie it immediately to concrete usage (with Azure ML).

If you have any questions, do not hesitate to punch the comment section below!