# Tag Archives: Machine Learning

Machine Learning theory & application, Azure ML & Cognitive Services.

# Training a model to predict failures

Today a quick entry to talk about a twist on Machine Learning for the predictive maintenance problem.

The Microsoft Cortana Intelligence team wrote an interesting blog the other day:  Evaluating Failure Prediction Models for Predictive Maintenance.

When you listen to all the buzz around Machine Learning, it sometimes feels as if we’ve solved all the ML problems and you just need to point your engine in the general direction of your data for it to spew some insights.  Not yet.

That article highlights a challenge with predictive maintenance.  Predictive maintenance is about predicting when a failure (of a device, hardware, process, etc.) will occur.  But…  you typically have very few failures:  your data set is completely imbalanced between success samples and failure samples.

If you go heads on and train a model with your data set, a likely outcome is that your model will predict success %100 of the time.  This is because it doesn’t get penalized that much by ignoring the few failures and gets rewarded by not missing any success.

Remember that Machine Learning is about optimizing a cost function over a sample set (training set).  I like to draw the comparison with humans on a social system with metrics (e.g. bonus in a job, laws in society, etc.):  humans will find any loophole in a metric in order to maximize it and rip the reward.  So does Machine Learning.  You therefore have to be on the lookout for those loopholes.

With predictive maintenance, failures often cost a lot of money, sometimes more than a false positive (i.e. a non-failure identified as a failure).  For this reason, you don’t want to miss failures:  you want to compensate for the imbalance in your data set.

The article, which I suggest you read in full when you’ll need to apply it, suggests 3 ways to compensate for the lack of failures in your data set:

1. You can resample the failures to increase their occurrence ; for instance by using the Synthetic Minority Oversampling Technique (SMOTE) readily available in Azure ML.
2. Tune the hyper-parameters of your model (e.g. by using a parameter sweep, also readily available in Azure ML) to optimize for recall.
3. Change the metric to penalize false negative more than false positive.

This is just one of the many twists you have to think of when creating a predictive model.

My basic advice:  make sure you not only ask the right question but with the right carrot & stick   If a failure, to you, costs more than a success identified as a failure (i.e. a false positive), then factor this in to your model.

# How to do Data Science

These days, it’s all about Data Science.

What is Data Science?

Last month Brandon Rohrer, from the Cortana Intelligence and Machine Learning Blog, came up with an excellent post.

The post basically goes over the workflow I reproduced at the right here.

I found this article both complete and succinct:  a very good read.

It goes through all the motions you need to go through while analysing data, from fixing an objective (asking a question), formatting the data to manipulating the data and finally throw that to some Machine Learning.

It is something of a badly kept secret that %95 of Machine Learning is Yak Shaving around massaging data.  This is why I like the terminology Data Science because it does includes all the data manipulation you need to do to feed the beast.

So I strongly encourage you to read the post on https://blogs.technet.microsoft.com/machinelearning/2016/03/28/how-to-do-data-science/

# What is Statistics and why should you care?

Unless you graduated in art, chances are you did a course in Statistics.

Chances are you hated it.

Most people I know postponed that course until the end of their degree, didn’t understand much about it and hated it dearly.

I didn’t like it either and understood very little.

A few years later when I studied Machine Learning, I had to review Statistics on my own.  This is when I had the epiphany:  Wow!  This is actually not so complicated and can even be quite interesting!

There are two main reasons I hated my undergraduate course:

• Examples were all around surveys:  I was studying physics at the time, I didn’t care about those
• It was really geared towards a collection of recipes:  I love mathematics, elegant theories and understanding what I do, cheat sheets didn’t do it for me

I would like to share my epiphany from back then with you today.  Hopefully it will shade some light on the poorly understood topic of statistics.

This won’t be a deep dive in the science of statistics.  I want to explain what statistics is by capturing where it comes from and giving very simple examples.  I won’t make statisticians out of you today.  Sorry.

## Layer Cake

I see statistics as a layer cake.  At the foundation we have combinatorics, then probability and finally at the top, we have statistics.

There lies one of the mistake, in my own opinion, of most statistics course:  they try to get statistics into your head without explaining the two colossus of science it is based on.

Try to explain calculus to somebody who has never seen f(x) = mx + b in 5 minutes and you won’t enlighten them either.

So let’s walk the layer cake from the bottom to the top.

## Combinatorics

Combinatoricsbranch of mathematics studying finite or countable discrete structures (Wikipedia).

Combinatorics is about counting stuff.  Counting elements in sets (cardinality) and then combining sets together.

You’ve done combinatorics, but you don’t remember, do you?  Let me jog your memory.

Ok, let’s say I have the set A = {1, 2, 3}.  How many pairs can I do with elements of A?  How many trios?  What if order is irrelevant and I just want to know the possible distinct pairs?

Yes, you’ve done that type of problems.  This is where you’ve learned a new meaning for the exclamation mark, i.e. the factorial:  n! = n x (n-1)! (with 0! = 1).

Let’s start an example I’ll carry over in the rest of the article.  Let’s say I have a die with six faces.  The set of possible outcome if I throw it is D = {1, 2, 3, 4, 5, 6}.

How many elements in D?  Six.  Not too hard?  Well, the point isn’t to be hard here.

You can get into quite complicated problems in combinatorics.  I remember an exam question where we had drawers filled with infinite amount of marbles having different colours and we had to mix them together…  that was quite fun.

A good example is the Rubik’s cube (from the Hungarian mathematician Ernő Rubik).  A Rubik cube has 6 faces, each having 9 squares.  6 colours, with 9 squares of each colours, 36 squares in total.  What is the number of possible configurations?  Are some configuration impossible given the physical constraints of the cube?

## Probability

Probabilitymeasure of the likelihood that an event will occur.  Probability is quantified as a number between 0 and 1 where 0 indicates impossibility and 1 indicates certainty. (Wikipedia)

The canonical example is the toss of a coin.  Head is 1/2 ; tail also is 1/2.

What is an event?  An event is an element of the set of all possible event.  An event occurrence is random.

A special category of event is of great interest:  equipossible events.  Those are events which all have the same chance of occurence.  My coin tossing is like that.  So is my 6-faced die…  if it hasn’t been tempered with.

For those, we have a direct link with combinatorics:

$P(event) = \frac{1}{\#events}$

The probability of an event is one over the number of possible events.  Let’s come back to my die example:

$P(1) =\frac{1}{\#D}=\frac{1}{6}$

The probability to get a 1 is 1/6 since #D, the cardinality of D (the number of elements in D), is 6.  Same for all the other events, i.e. 2, 3, 4, 5, 6.

So you see the link with combinatorics?  You always compare things that you’ve counted.

If events aren’t equipossible, you transform your problem until they are.  This is where all the fun resides.

Now, as with combinatorics, you can go to town with probability when you start combining events, conditioning events and…  if you start getting into the objective versus subjective (Bayesian) interpretations.  But again, my goal isn’t to deep dive but just to illustrate what probability is.

## Statistics

Statisticsstudy of the collection, analysis, interpretation, presentation, and organization of data (Wikipedia).

I find that definition quite broad as plotting graph becomes statistics.

I prefer to think of statistics as a special class of probability.  In probability we define models by defining sets of events and the likelihood of events to occur ; in statistics we take samples and check how likely they are according to the model.

A sample is simply a real world trial:  throwing a die, tossing a coin, picking an individual from a population, etc.  .

For instance, given my 6 faces die, let’s say I throw it once and I get a 2.  How confident are we that my die is equipossible with 1/6 probability for each face?

Well…  it’s hard to tell, isn’t it?  A lot of probability model could have given that outcome.  Anything where 2 has a non-zero probability really.

What if I throw twice and get a 2 each time?  Well…  that is possible.  What about three 2?  What about 10?  You’ll start to get skeptical about my die being equipossible as we go, won’t you?

Statistics allow you to quantify that.

When you hear about confidence interval around a survey, that’s exactly what that is about.

If I take my sequence of ‘2’ in my die throwing experiment, we can quantify it this way.  Throwing a ‘2’ has a probability of 1/6, which is the same probability for any result.  Now having the same result n times has a probability $P_{same} = (\frac{1}{6})^{n-1}$ while having a sequence of non-repeating results has probability of $P_{different} = (\frac{5}{6})^{n-1}$.  So as n increases, it gets less and less likely to have a sequence of the same result.  You can set a threshold and take a decision if you believe the underlying model or not ; in this case, if my die is or not a fair one.

## Why should you care?

Statistics is at the core of many empirical sciences.  We use statistics to test theories, to test models.

Also, those three areas of mathematics (i.e. combinatorics, probability & statistics) spawn off into other theories.

For example, information theory is based on probability.  That theory in turn helps us understand signal processing & data compression.

Machine Learning can be interpreted as statistics.  In Machine Learning, we define a class of statistical model and then look at samples (training set) to find the best model fitting that sample set.

Moreover, those area of mathematics allow us to quantify observations.  This is key.  In Data Science, you take a volume of data and you try to make it talk.  Statistics help you do that.

When you take a data set and observe some characteristics (e.g. correlation, dependency, etc.), one of the first thing you’ll want to validate is the good old “is it statistically significant”?  This is basically figuring out if you could observe those characteristics by chance or are they a real characteristic of the data?  For instance, if I look at cars on the freeway and I observe a blue car then a red car a few times, is that chance or is there enough occurrence to think there is a real pattern?

So if you are an executive, you should care about statistics to go beyond just looking (visualizing) the data of your business and understanding, at least at a high level, what type of models and assumptions your data scientists are making on your data.  Are you training models to learn trend in your business?  If so, what are the models look like and how do they perform in terms of prediction?

If you are a developer / architect, you should care about statistics for two big reasons.  First, you are probably instrumental in taking decision on what type of data you collect from the application and at which frequency (e.g. telemetry).  If you log the number of users logged in once a day, your data scientists will have a hard time extracting information from that data.  The second reason is that you are likely going to use data to display and, more and more, to have your application take decision.  Try to understand the data and the models used for decision making.

We live in a world of data abundancy.  Data is spewing from every device & every server.  It is easy to see features in data that are simply noise or do not see feature because they aren’t visible when you visualize data.  Statistics is the key to your data vault.

## Summary

I hope my article was more insightful than the statistic classes you remember.

Basically, combinatorics studies countable sets.  Probability uses combinatorics to assign probability (value between 0 & 1) to events.  Statistics takes sample and compare them to probability models.

Those fields of study have massive influence in many other fields.  They are key in Machine Learning and Data Science in general.

# Where is the statistics in Machine Learning?

I often try to explain what Machine Learning is to people outside the field.  I’m not always good at it but I am getting better.

One of the confusion I often get when I start to elaborate the details is the presence of statistics in Machine Learning.  For people outside the field, statistics are the stuff of survey or their mandatory science class (everyone I know outside of the Mathematics field who had a mandatory Statistic class in their first Uni year ended up postponing it until the end of their undergraduate studies!).

It feels a bit surprising that statistics would get involved while a computer is learning.

So let’s talk about it here.

Back in the days there was two big tracks of Artificial Intelligence.  One track was about codifying our knowledge and then run the code on a Machine.  The other track was about letting the machine learn from the data ; that is Machine Learning in a nutshell.

The first approach has its merit and had a long string of successes such as Expert Systems based on logic and symbolic paradigms.

The best example of those I always give is in vaccine medicine.  The first time I went abroad in tropical countries, in 2000, I went to a vaccine clinic and met with a doctor.  The guy asked me where I was going, consulted a big reference manual, probably used quite a bit of his knowledge and prescribed 3 vaccines his nurse administered to me.  In 2003, when I came back to Canada, I had a similar experience.  In 2009…  nope.  Then I met with a nurse.  She sat behind a computer and asked me similar questions, punching the answer in the software.  The software then prescribed me 7 vaccines to take (yes, it was a more dangerous place to go and I got sick anyway, but that’s another story).

That was an expert system.  Somebody sat with experts and asked them how they proceeded when making a vaccine prescriptions.  They took a lot of notes and codify this into a complex software program.

Expert systems are great at systematizing repetitive tasks requiring a lot of knowledge that can ultimately be described.

If I would ask you to describe me how you differentiate a male from a female face while looking at a photograph, it might get quite hard to codify.  Actually, if you would answer me at all, you would probably give a bad recipe since you do that unconsciously and you aren’t aware of %20 of what your brain does to take the decision.

For those types of problems, Machine Learning tends to do better.

A great example of this is how much better Machine Learning is at solving translation problems.  A turning point was the use of Canadian Hansards (transcripts of Parliamentary Debates in both French & English) to train machines how to translate between the two languages.

Some people, e.g. Chomsky, opposes Machine Learning as it actually hides the codification of knowledge.  Actually there is an interesting field developing to use machines to build explanation out of complicated mathematical proofs.

Nevertheless, why is there statistics in Machine Learning?

When I wrote posts about Machine Learning to explain the basics, I gave an example of linear regression on a 2D data sets.

I glossed over the fact that the data, although vaguely linear, wasn’t a perfect line, far from it.  Nevertheless, the model we used was a line.

This is where a statistical interpretation can come into play.

Basically, we can interpret the data in many ways.  We could say that the data is linear but there were errors in measurements.  We could say that there are much more variables involves that aren’t part of the dataset but we’ll consider those as random since we do not know them.

Both explanation links to statistics and probability.  In the first one we suppose the errors are randomly distributed and explain deviation from linear distribution.  In the second we assume missing data (variables) that if present would make the dataset deterministic but in its absence we model as a random distribution.

Actually, there is a long held debate in physics around the interpretation of Quantum Mechanics that oppose those two explanations.

In our case, the important point is that when we build a machine learning model out of a data set, we assume our model will predict the expectation (or average if you will) of an underlying distribution.

Machine Learning & Statistics are two sides of the same coin.  Unfortunately, they were born from quite different scientific disciplines, have different culture and vocabularies that entertain confusion to this day.  A good article explaining those differences can be found here.

# Strong AI & Existential Risks

There has been a recrudescence of hysterical talks about Strong Artificial Intelligence (AI) lately.

Strong AI is artificial intelligence matching and eventually going beyond the full human cognitive capacity.  Weak AI, by opposition, is the replication of some facets of human cognition:  face recognition, voice recognition, pattern matching, etc.  .

The ultimate goal of AI is to get to what researchers call Artificial General Intelligence, i.e. an AI that could duplicate human cognitive capacity in every aspects, including creativity and social behaviour.

Today, what we have is weak AI.  For some tasks, weak AI perform better than humans already though.  For instance, the Bing Image team has observed that for classifying images in multiple categories (thousands), AI performed better (did less errors) than their human counterparts.  For a low amount of categories, humans are still better.  IBM Deep Blue was able to beat the world chess champion, IBM Watson is able to win at Jeopardy, etc.  .

But we are still far from Artificial General Intelligence.  Machines lack social aptitudes, creativity, etc.  .  Nevertheless they are evolving at a rapid pace.

You might have read about the triumvirate of Elon Musk, Stephen Hawking and Bill Gates warning the world of the danger of strong AI.

Those declarations made quite some noise as those three gentlemen are bright and sober people you do not see wearing tin foil hats on national TV.  They also are super stars of their field.

Although their arguments are valid, I found the entire line of thoughts to be a bit vague and unfocussed to be compelling.

I compare this to when Hawking has been warning us about toning down the volume of our space probe messages in order not to attract malevolent aliens.  There was some merits to the argument I suppose, but it wasn’t well developed in my own opinion.

For strong AI, I found a much more solid argument at .

Ah, TED…  What did I do before I could get my 20 minutes fixes with you?  Thank you Chris Anderson!

Anyway, there was Nick Bostrom, a Swedish philosopher which came to discuss What happens when our computers get smarter than we are?

Right off the bat, Bostrom makes you think.  He shows a photo of Milton from Office Space and mentions “this is the normal ways things are”, you can’t help but reflect on a few things.

Current office work / landscape is transitional.  There has been office workers since the late 19th century / early 20th century, as Kafka has depicted so vividly.  But the only constant in office work has been change.

In the time of Kafka, computers meant people carrying out computations!

Early in the banking career of my father, he spent the last Thursday night of every month at the office along a bunch of other workers computing the interests on the bank accounts!  They were using Abacus and paper since they didn’t have calculators yet.  You can imagine the constraints that put on the number of products the banks were able to sell back in those days.

I’ve been in IT for 20 years and about every 3 years my day to day job is redefined.  I’m still a solution architect, but the way I do my work, I spend my time, is different.  IT is of course extreme in that respect but the point is, office work has been changing since it started.

This is what you think about when you look at the photo Bostrom showed us.  This is what office work looked like in the 1990’s and already is an obsolete model.

The point of Bostrom was also that we can’t take today’s office reality as an anchor on how it will be in the future.  He then goes on to define existential risks (a risk that, if realised, would cause humanity extinction or drastic drop in the quality of life of human beings) and how Strong AI could pose one.  He spends a little time debunking typical arguments about how we could easily contain a strong AI.  You can tell it is a simplification of his philosophical work for a big audience.

All of this is done in a very structured and compelling way.

Parts of his argument resides in the observation that in the evolution of AI from Weak AI to Artificial General Intelligence to Strong AI, there is no train stop.  Once AI reaches our level it will be able to evolve on its own, without the help of human researchers.  It will continue to a Superintelligence, an AI more capable than humans in most ways.

His conclusion isn’t that we should burn computers and go back to compute on paper but that we should prepare for a strong AI and work hard on how we will instruct it.  One can’t help to think back to Isaac Asimov 3 laws of robotics which were created, in the fictional world, for the exact same purpose.

He casts Strong AI as an optimization machine, which is how we defined Machine Learning at the end of a previous post.  His argument is that since AI will optimize whatever problem we give it, we should think hard about how we defined the problem so that we don’t end up being optimized out!

Nick Bostrom actually joined Elon Musk, Stephen Hawking and others in signing the open letter of the Future of Life Institute.  That institution is doing exactly what he proposed in the Ted Talk:  research on how to mitigate the existential risk posed by Strong AI.

If you are interested by his line of thoughts, you can read his book Superintelligence: Paths, Dangers, Strategies, where those arguments are developed more fully.

As mentionned above, I found this talk gives a much better shape to the existential risk posed by super intelligence argument we hear these days.

Personally, I remain optimistic regarding an existential risk posed by Strong AI, probably because I remain pessimistic for its realisation in a near future.  I think our lives are going to be transformed by Artificial Intelligence.  Super Intelligence might not occur in my life time, but pervasive AI definitely will.  It will assist us in many tasks and will gradually take off tasks.  For instance, you can see with the like of Skype Translator that translation will, in time, be done largely by machines.

Projections predict a huge shortage of software engineer.  For me this is an illusion.  What it means is that tasks currently requiring software engineering skills will become more an more common in the near future.  Since there aren’t enough software engineer around to do them, we will adapt the technology so that people with a moderate computer literacy background can perform them.  Look at the like of Power BI:  it empowers people with moderate computer skills to do analysis previously requiring BI experts.  Cortana Analytics Suite promises to take it even further.

I think we’ll integrate computers and AI more and more in our life & our work.  It will become invisible and part of us.

I think there are 3 outcomes of Strong AI:

1. We are doom, Terminator 2 type of outcome
2. We follow Nick Bostrom and make sure Strong AI won’t cause our downfall
3. We integrate with AI in such a way there won’t be a revolution but just a transition

Intuitively I would favor the third option as the more likely.  It’s a sort of transhumanism, where humanity will gradually integrate different technologies to enhance our capacities and condition.  Humanity would transition to a sort of new species.

All depends on how fast the Strong AI comes I suppose.

Ok, enough science-fiction speculation for today!  AI today is weak AI, getting better than us at very specific tasks, otherwise, we still rule!

Watch the TED talk, it’s worth it!

# Free ebook: Azure Machine Learning

You’re into Machine Learning, got into Azure ML, looked at my couple of blogs about it and want to take it to the next level?

Microsoft released an eBook for that exact purpose:

Free ebook: Azure Machine Learning (Microsoft Azure Essentials)

That book is targeted at people who want to get better with Azure ML tool:  developer, data scientist, data analyst, etc.  .  It goes into a bit of conceptual (Machine Learning in general) and quickly dive into practical examples with step by step and screen shots.

The book is available in ePub (ideal for commuting!), PDF & Mobi.

Here are the chapters:

• Chapter 1, “Introduction to the science of data,” shows how Azure Machine Learning represents a critical step forward in democratizing data science by making available a fully-managed cloud service for building predictive analytics solutions.
• Chapter 2, “Getting started with Azure Machine Learning,” covers the basic concepts behind the science and methodology of predictive analytics.
• Chapter 3, “Using Azure ML Studio,” explores the basic fundamentals of Azure Machine Learning Studio and helps you get started on your path towards data science greatness.
• Chapter 4, “Creating Azure ML client and server applications.” expands on a working Azure Machine Learning predictive model and explores the types of client and server applications that you can create to consume Azure Machine Learning web services.
• Chapter 5, “Regression analytics,” takes a deeper look at some of the more advanced machine learning algorithms that are exposed in Azure ML Studio.
• Chapter 6, “Cluster analytics,” explores scenarios where the machine conducts its own analysis on the dataset, determines relationships, infers logical groupings, and generally attempts to make sense of chaos by literally determining the forests from the trees.
• Chapter 7, “The Azure ML Matchbox recommender,” explains one of the most powerful and pervasive implementations of predictive analytics in use today on the web today and how it is crucial to success in many consumer industries.
• Chapter 8, “Retraining Azure ML models,” explores the mechanisms for incorporating “continuous learning” into the workflow for our predictive models.

Enjoy!

# Azure ML – Over fitting with Neural Networks

In a past post, I discussed the concept of over fitting in Machine Learning.  I also alluded to it in my post about Polynomial Regression.

Basically, over fitting occurs when your model performs well on training data and poorly on data it hasn’t seen.

In here I’ll give an example using Artificial Neural Networks.  Those can be quite prone to over fitting since they have variable number of parameters, i.e. different number of hidden nodes.  Over fitting will always occur once you put too many parameters in a model.

### Data

I’ll reuse the height-weight data set I had you created in a past post.  If you need to recreate it, go back to that post.

Then let’s create a new experiment, let’s title it “Height-Weight – Overfitting” and let’s drop the data set on it.

We do not want the Index column and I would like to rename the columns.  Let’s use our new friend the Apply SQL Transformation module with the following SQL expression:

SELECT
“Height(Inches)” AS Height,
“Weight(Pounds)” AS Weight
FROM t1

In one hit, I renamed fields and remove another one.

We will then drop a Split module and connect it to the data set.  We will configure the split module as follow:

The rest can stay as is.

Basically, our data set has 200 records and we’re going to take only the first 5 (.025 or %2.5) to train a neural network and the next 195 to test.

Here I remove the “Randomized split” so you can obtain results comparable to mines.  Usually you leave that on.

As you can see I’m really setting up a gross over fitting scenario where I starve my learning algorithm, showing it only a little data.

It might seem aggressive to take only %2.5 of the data for training and it is.  Usually you would take %60 and above.  It’s just that I want to demonstrate over fitting with only 2 dimensions.  It usually occur at higher dimensions, so in order to simulate it at low dimension, I go a bit more aggressively with the training set size.

The experiment should look like this so far

### Learning

Let’s drop Neural Network Regression, Train Model, Score Model & Evaluate Model modules on the surface and connect them like this:

In the Train Model module we select the weight column.  This is the column we want to predict.

In the Neural Network Regression module, we set the number of hidden nodes to 2.

To really drive the point home and increase over fitting, let’s crank the number of learning iteration to 10 000.  This will be useful when we’ll increase the number of parameters.

And let’s leave the rest as is.

Let’s run the experiment.  This will train a 2-hidden nodes neural network with 5 records and evaluates it against those 5 records.

We can look at the result of Evaluate Model module:

The metrics are defined here.  We are going to look at the Relative Squared Error.

The numbers you get are likely different since Neural Networks are optimized using approximation methods and randomization.  So each run might yield different results.

### Testing

Now how does the model performs on data it didn’t see in its training?

Let’s drop another Score Model & Evaluate Model modules and connect them like this:

Basically we will compute the score, or prediction, using the same train model but on different data, on the 195 remaining records not used during testing.

We run the experiment again and we get the following results on the test evaluation:

The evaluation is higher than the training data.  It nearly always is.  Let’s see how does evolve when we increase the number of hidden nodes.

### Comparing with different number of nodes

We are going to compare how those metrics evolve when we change the number of parameters of the Neural Network model, i.e. the number of hidden nodes.

I do not know of a way to “loop” in AzureML.  It would be very nice if I could wrap the experiment in a loop and have the number of hidden nodes of the model vary within the loop.  If you know how to do that, please leave a comment!

Failing that, we are going to manually change the number of hidden node in the Neural Network Regression module.

In order to make our life easier, let’s make the reporting of the results we are looking for more straightforward than having to open the results of the two Evaluate Model modules.  Let’s drop another Apply SQL Transformation and connect it this way:

and type the following SQL expression in:

SELECT
t1.”Relative Squared Error” AS TrainingRSE,
t2.”Relative Squared Error” AS TestingRSE
FROM t1, t2

We are basically taking both outputs of Evaluation Model modules and renaming them which gives us (after another run) the nice result:

Neat, hen?

Ok, now let’s grind and manually change the hidden number of nodes a few time to fill the following table:

 # Nodes Training RSE Testing RSE Ratio 2 0.154358 3.270625 4.7% 3 0.154394 3.366281 4.6% 4 0.154442 3.673096 4.2% 5 0.154488 3.455582 4.5% 7 0.154612 3.835834 4.0% 10 0.154847 4.242703 3.6% 15 0.155334 4.301146 3.6% 20 0.155946 4.281125 3.6% 30 0.157558 4.222742 3.7% 50 0.162018 3.586147 4.5% 75 0.006491 3.920912 0.2% 100 0.006791 3.082774 0.2% 150 0.000025 2.544964 0.0% 200 0.000015 2.249117 0.0%

We can see that as we increased the number of parameters, the training error got lower and the testing error got higher.  At some point, the testing error start going down again, but the ratio between the two always go down.

A more visual way to look at it is to look at the actual predictions done by the models.

The green dots are the 200 points, the 5 yellow dots are the training set while the blue dots are the prediction for a Neural Network with 200 hidden nodes.

We can tell the prediction curves goes perfectly on the training points but poorly describe the entire set.

### Summary

I wanted to show what over fitting could be like.

Please note that I did exaggerate a lot of elements:  the huge number of training iterations, the huge number of hidden nodes.

The main point I want you to remember is to always test your model and do not select the maximum number of parameters available automatically!

Basically over fitting is like learning by heart.  The learning algorithm learns the training set perfectly and then generalizes poorly.