Azure ML – Over fitting with Neural Networks

In a past post, I discussed the concept of over fitting in Machine Learning.  I also alluded to it in my post about Polynomial Regression.

Basically, over fitting occurs when your model performs well on training data and poorly on data it hasn’t seen.

In here I’ll give an example using Artificial Neural Networks.  Those can be quite prone to over fitting since they have variable number of parameters, i.e. different number of hidden nodes.  Over fitting will always occur once you put too many parameters in a model.


I’ll reuse the height-weight data set I had you created in a past post.  If you need to recreate it, go back to that post.

Then let’s create a new experiment, let’s title it “Height-Weight - Overfitting” and let’s drop the data set on it.

We do not want the Index column and I would like to rename the columns.  Let’s use our new friend the Apply SQL Transformation module with the following SQL expression:

SELECT “Height(Inches)” AS Height, “Weight(Pounds)” AS Weight FROM t1

In one hit, I renamed fields and remove another one.

We will then drop a Split module and connect it to the data set.  We will configure the split module as follow:


The rest can stay as is.

Basically, our data set has 200 records and we’re going to take only the first 5 (.025 or %2.5) to train a neural network and the next 195 to test.

Here I remove the “Randomized split” so you can obtain results comparable to mines.  Usually you leave that on.

As you can see I’m really setting up a gross over fitting scenario where I starve my learning algorithm, showing it only a little data.

It might seem aggressive to take only %2.5 of the data for training and it is.  Usually you would take %60 and above.  It’s just that I want to demonstrate over fitting with only 2 dimensions.  It usually occur at higher dimensions, so in order to simulate it at low dimension, I go a bit more aggressively with the training set size.

The experiment should look like this so far



Let’s drop Neural Network Regression, Train Model, Score Model & Evaluate Model modules on the surface and connect them like this:


In the Train Model module we select the weight column.  This is the column we want to predict.

In the Neural Network Regression module, we set the number of hidden nodes to 2.


To really drive the point home and increase over fitting, let’s crank the number of learning iteration to 10 000.  This will be useful when we’ll increase the number of parameters.


And let’s leave the rest as is.

Let’s run the experiment.  This will train a 2-hidden nodes neural network with 5 records and evaluates it against those 5 records.

We can look at the result of Evaluate Model module:


The metrics are defined here.  We are going to look at the Relative Squared Error.

The numbers you get are likely different since Neural Networks are optimized using approximation methods and randomization.  So each run might yield different results.


Now how does the model performs on data it didn’t see in its training?

Let’s drop another Score Model & Evaluate Model modules and connect them like this:


Basically we will compute the score, or prediction, using the same train model but on different data, on the 195 remaining records not used during testing.

We run the experiment again and we get the following results on the test evaluation:


The evaluation is higher than the training data.  It nearly always is.  Let’s see how does evolve when we increase the number of hidden nodes.

Comparing with different number of nodes

We are going to compare how those metrics evolve when we change the number of parameters of the Neural Network model, i.e. the number of hidden nodes.

I do not know of a way to “loop” in AzureML.  It would be very nice if I could wrap the experiment in a loop and have the number of hidden nodes of the model vary within the loop.  If you know how to do that, please leave a comment!

Failing that, we are going to manually change the number of hidden node in the Neural Network Regression module.

In order to make our life easier, let’s make the reporting of the results we are looking for more straightforward than having to open the results of the two Evaluate Model modules.  Let’s drop another Apply SQL Transformation and connect it this way:


and type the following SQL expression in:

SELECT t1.”Relative Squared Error” AS TrainingRSE, t2.”Relative Squared Error” AS TestingRSE FROM t1, t2

We are basically taking both outputs of Evaluation Model modules and renaming them which gives us (after another run) the nice result:


Neat, hen?

Ok, now let’s grind and manually change the hidden number of nodes a few time to fill the following table:

# Nodes Training RSE Testing RSE Ratio
2 0.154358 3.270625 4.7%
3 0.154394 3.366281 4.6%
4 0.154442 3.673096 4.2%
5 0.154488 3.455582 4.5%
7 0.154612 3.835834 4.0%
10 0.154847 4.242703 3.6%
15 0.155334 4.301146 3.6%
20 0.155946 4.281125 3.6%
30 0.157558 4.222742 3.7%
50 0.162018 3.586147 4.5%
75 0.006491 3.920912 0.2%
100 0.006791 3.082774 0.2%
150 0.000025 2.544964 0.0%
200 0.000015 2.249117 0.0%

We can see that as we increased the number of parameters, the training error got lower and the testing error got higher.  At some point, the testing error start going down again, but the ratio between the two always go down.

A more visual way to look at it is to look at the actual predictions done by the models.


The green dots are the 200 points, the 5 yellow dots are the training set while the blue dots are the prediction for a Neural Network with 200 hidden nodes.

We can tell the prediction curves goes perfectly on the training points but poorly describe the entire set.


I wanted to show what over fitting could be like.

Please note that I did exaggerate a lot of elements:  the huge number of training iterations, the huge number of hidden nodes.

The main point I want you to remember is to always test your model and do not select the maximum number of parameters available automatically!

Basically over fitting is like learning by heart.  The learning algorithm learns the training set perfectly and then generalizes poorly.

Leave a comment