In a past blog entry I gave an overview of what Machine Learning is. I showed a simple linear regression example. The goal really was to explain to newcomer to the field what Machine Learning is, what type of problem it tries to solve and what the general approach is.

Of course, I used an extremely simple example in order to add noise to that main goal.

Machine Learning is about extracting information from a sample data set in order to select the optimal model (best set of parameters) fitting the sample set.

In this blog, I’ll give you a more complete picture by looking at different aspects of more realistic Machine Learning scenarios.

### Different learning problems

We looked at one type of learning problems: regression, i.e. predicting continuous values for dependant variables given values of independent variables.

There are at least two other popular learning problems: classification & clustering.

**Classification** problem is a little like regression where you are trying to model *f* where *y = f(x)* but instead of having *y* being a continuous variable, *y* is discrete, i.e. takes a finite number of values, e.g. {big, small}, {unknown, male, female}, etc. .

An example of classification problem from bioinformatics would be to take genomic scans (e.g. DNA microarray results) of a patient and predict if they are prone to develop cancer (true) or not (false) based on a sample set of patients. The dependant variable here would take a Boolean value: {true, false}.

**Clustering** problem, on the other hand, consists in predicting a class **without** having classes in the sample set. Basically, the algorithms try to segment the data automatically, as opposed to learning it as in classification problems.

An example of clustering, in marketing, would be to take customer data (e.g. demographics & buying habits) and ask the system to segment it into 3 groups.

There are, of course, many more learning problems. For instance time series analysis, where we do not look at static data but at how it varies in time.

### Different model classes

We looked at one class of models: linear regression.

This is a special type of regression. This is one of the simplest algorithms but also, often the most useful.

Starting from the linear regression model, we could go non-linear, e.g. to polynomial, so instead of having *f(x) = m*x + b*, we would have

f(x) = a_0 + a_1*x + a_2*x^2 + a_3*x^3 + … + a_n*x^n

For a given **degree** *n*. The bigger the degree, the more the function can curve to fit the sample set as shown in the following examples:

Another popular class of models are the neural networks. They too have variable number of parameters and also have varying topology (single layer, multi-layer, deep, with feedback loops, etc.).

Classification problems give rise to different models. A linear separator model actually borrow a lot of the conceptual of a linear regression: imagine that a line is used not to predict points but to separate them in two categories. Same thing could be said for a lot of regression models.

The most popular model (and the simplest) for clustering is k-means.

### Dimensions

Most real life problem are not about mapping one independent variable on one dependant one like we did in the previous post. There are multiple dimensions, both as independent and dependant (predicted) variables.

Multiple dimensions bring two complexities to the table: visualization & curse of dimensionality.

Having more than 2 or 3 dimensions makes visualization non trivial. Actually, some data analysis / simple machine learning methods (e.g. linear regression) can be useful to do a first analysis on the data before choosing what type of ML model to apply to it.

Most models are geared to work in multiple dimensions. The general linear model, for instance, works in multiple dimensions.

When the number of dimension is high, a well known set of phenomena, refer to as the curse of dimensionality, occurs. Basically, our intuition in low-dimensions (e.g. 2, 3) fails us in high dimension.

To give only one example of such phenomena, if you take an hyper-sphere of dimension *d*, the distribution of points within the sphere changes drastically as *d* increases. With *d*=2, a circle, if you pick a point at random within the circle, the distance between the point and the center of the circle will have a very roughly uniform distribution. When *d* increases, the volume of the hyper-sphere concentrates closer to the surface, making the distribution extremely high for distances close to the radius of the hyper-sphere and close to zero towards the middle. Can you picture that in your head? I can’t and that’s why high dimensions is tricky to work with.

A lot of techniques fail to work at high dimension because of that.

### Testing set

An importing activity someone wants to do with a model is to validate how optimal it is.

Since we select the optimal model for a given sample set, how much optimal can it get? Put otherwise, how can we measure the quality of the model?

A typical way to do that is to split the available sample set in two sets:

- A
**learning**data set - A
**testing data set**

The learning data set is used to learn / optimize the model’s parameters while the testing data set is used to validate the output model.

The validation is often done by computing the cost function over the testing set. This is a good way to compare two model classes, e.g. linear regression vs Neural network, to see how they each perform. Using this kind of validation, one can actually train different class of models and select the best one overall. Testing also allow someone to avoid over fitting, which is our next topic.

### Over fitting

You might have noticed I’ve talked about different models having varying number of parameters and the number of parameters allowing to better fit the training data? For instance, a polynomial regression: a linear regression has two parameters (origin and slope), a cubic regression has 4, etc. .

You might then wonder, why not put the maximum number of parameters in there and let it do its thing? Yes, there is a catch. The catch is that if you do that, you will over fit your training data set and your model will be excellent at predicting the data it has seen and terrible at doing any generalization.

As usual, it is easier to explain by showing an example. Let’s take my example from the previous article. To make the example obvious, I’ll exaggerate. I’ll take only two points from the training set (circled in green) and do a linear regression on those.

You see how the top line fits perfectly those two points? The distance between the points and the line is zero. The cost function is zero. Perfect optimization. But you also see how poorly it predicts the other points compare to the bottom line? The top line over fits the data.

If I did a split between a training and a test set as discussed in the previous section, I would have been able to measure the poor generalization quality.

This is actually one way to fight the over fitting: to test the learning algorithm and select one with better generalization capacity. For instance, I could say I’ll go with polynomial regression but select the polynomial degree using the testing set.

Again, another example with a polynomial regression of high degree (say n>30) fitting perfectly a few points but having poor generalization capacity.

There are two sides at over fitting: a large number of parameters and a low number of points in the sample set. One can therefore either lower the number of parameters or increase the data set size, which is the next topic.

It is to be noted that a large number of parameters used with a data set containing an enormous amount of points won’t over fit: parameters will adjust and be set to zero. A line is a special kind of cubic, for instance.

### Data set size

As hinted in the previous topic, increasing the number of points in a training data set increases the ability of the trained model to generalized.

Data set size have exploded in the last couple of years (yes, Internet basically, but also cheap storage) and that is one of the main reason for Machine Learning renaissance.

This is why combining big data with Machine Learning is key to unleash its potential.

### Iteration / Approximation

With linear regression, we can compute an exact solution. Most models do not allow that. Typically we iteratively approximate a solution. Furthermore, for any non-trivial learning problem, there are multiple local optima, which make finding the global optimum all the more difficult.

This would require an entire blog post to scratch the surface but suffice to say that you’ll see a lot of notion of **steps** or **iterations** in Machine Learning models. It has to do with the approximation methods used.

For instance, one of the way to avoid over fitting is to stop the approximation process before it reaches a local optimum. It can be shown that this improves generalization capacity of the model.

### Data Quality

I kept this one for the end, but believe me, it isn’t the least. It plagues every Machine Learning problem.

If there is one field of science where the saying *Rubbish In, Rubbish Out* applies, it is in Machine Learning.

Machine Learning models are extremely sensitive to the quality of data.

Real Data Set you take from Databases are rubbish. They are full of missing data, data has been captured in (undocumented) different conditions and sometimes the semantic of data changes across time.

This is why a lot of the leg work in Machine Learning has little to do with Machine Learning but with Data Cleansing.

Let me give you a real example I stumble upon lately. We were discussing an Internet of Thing (IoT) problem with a customer. We were planning to install vibration / temperature sensor on industrial machines to monitor their manufacturing process. We were thinking about how we could eventually leverage the collected data with Machine Learning.

The key point I made in that conversation was that it would need to be design up front into the IoT implementation. Otherwise, I can guarantee that different sensors wouldn’t be calibrated the same way and that Machine Learning models would just get lost in the fuzzy data and make poor predictions as a consequence.

The basics of Data Cleansing is to normalize the data (center it around its average and make its standard deviation equal to one), but it goes beyond that.

This is why I see Machine Learning as yet another step in the B.I. pipeline of a company: first have your transaction data, then aggregate your data, control the quality of your data, then you can analyse and make prediction.

### Summary

In the last two blog posts, I’ve tried to give you a taste of what Machine Learning is. I hope it wasn’t too much of a torrent of information to take in.

It is important to understand that type of basis before jumping into powerful tools such as Azure Machine Learning.

ML is a relatively mature field with more than 30-50 years of age (depending on where you draw the line between Machine Learning, Data Analysis and Statistics), so there is a lot out there. It is also a young field where progress are made every day.

What makes it exciting today is the capacity we have to gather data and process it so quickly!

If you would like a complementary reading, take a look at Azure Machine Learning for Engineers. They take machine learning from a different angle and tie it immediately to concrete usage (with Azure ML).

If you have any questions, do not hesitate to punch the comment section below!

**UPDATE**: See all those concepts put into practice in Azure ML – Simple Linear Regression.

Pingback: Machine Learning – An introduction – Part 1 | Vincent-Philippe Lauzon's blog

Pingback: Azure ML – Simple Linear Regression | Vincent-Philippe Lauzon's blog

Pingback: AzureML – Polynomial Regression with SQL Transformation | Vincent-Philippe Lauzon's blog

Pingback: Azure ML – Over fitting with Neural Networks | Vincent-Philippe Lauzon's blog

Pingback: Where is the statistics in Machine Learning? | Vincent-Philippe Lauzon's blog