I meant to illustrate over fitting (discussed in a past blog) with AzureML. An easy way to illustrate it is to fit a bunch of sample points near perfectly and the best tool for that is Polynomial Regression.

I was surprised to see that AzureML doesn’t support Polynomial Regression natively. But… while thinking about it, you can implement it using a linear regression. In order to do that, I’ll introduce a useful module of AzureML: *Apply SQL Transformation*.

So I will keep over fitting for a future blog and concentrate on polynomial regression for today!

### Data Set

But first, let’s construct a data set. I want something that looks like a linear pattern with a bit of a wave driving it to simulate a bit of noise on top of a linear pattern.

Let’s build it in Excel. Very small data set: 20 points. Two columns: X & Y. Column X goes from 1 to 20. Column Y is a formula: *=SIN(3.5*(A2-1)/20*2*PI())+A2/2*. So basically, I add a sinusoid to a linear formula. This gives the following data set:

X |
Y |

1 | 0.5 |

2 | 2.526591 |

3 | 2.540285 |

4 | 2.092728 |

5 | 2.118427 |

6 | 2.365618 |

7 | 4.31284 |

8 | 5.546013 |

9 | 5.359401 |

10 | 5.086795 |

11 | 4.71177 |

12 | 5.876326 |

13 | 7.551295 |

14 | 8.585301 |

15 | 7.922171 |

16 | 7.470766 |

17 | 8.179313 |

18 | 9.332662 |

19 | 10.66634 |

20 | 11.06512 |

If you plot that in Excel:

As I’ve shown in a previous blog, we can take that data and import it in AzureML as a Data Set. I’ll skip the details for that here.

### From polynomial to linear

So AzureML doesn’t support polynomial regression. What can we do?

If you think about it 2 minutes, a polynomial regression is polynomial only in terms of the observed data, not the parameters. A polynomial regression, in one dimension, is

f(x) = a_0 + a_1*x + a_2*x^2 + a_3*x^3 + … + a_n*x^n

which looks like a linear regression with n dimensions in input and one in output. The input vector would be (x, x^2, …, x^n). The regression becomes:

f(x) = a_0 + (a_1, a_2, …, a_n) * (x, x^2, …, x^n)

where the multiplication here is a scalar multiplication between two vectors. Therefore we are back to a linear regression!

The trick is simply to augment the data set for it to contain the square, cube, etc. of the observed (independent) variable.

### Apply SQL Transformation

We could augment the data set directly in the data set but that’s a bit clunky as it pollutes the data set. Ideally we would do it “dynamically”, i.e. within an experiment. Enters *Apply SQL Transformation*.

Let’s start a new experiment and drop the data set we just created on it. Then, let’s drop a *Apply SQL Transformation* module (you can search for *SQL*) and link the two together:

*Apply SQL Transformation* has three entry points but only one is mandatory. The entry points are like SQL tables you would feed it. In our case, we only have one data set.

If you select the module you’ll see it takes an SQL expression in parameter:

*t1* in the SQL expression refers to the first entry point of the module. You could also use *t2* and *t3* if you would connect the other entry points.

It is said in the documentation that it understands SQLite. For our needs the limitation of SQLite vs TSQL won’t be any problem.

We will input this SQL:

*SELECT
X,
X*X AS X2,
Y
FROM t1*

Basically we do a typical SQL projection using the SQL syntax. This is quite powerful and can easily replace *Project Columns* and *Metadata Editor* modules in one go.

You can run the experiment and then look at the results by right clicking at the bottom of the SQL module.

You can see that the column we added is there with square value.

### Doing the linear regression

We can then continue and drop a linear regression, train model & score model on the experiment and connect them like this:

By selecting the train model module, we can click its “Launch column selector” button and select the *Y* variable. That is the variable the linear regression will predict.

We can now run the experiment and then look at the result of the *Train Model *module (not the score model one).

This is an interesting result. AzureML is associating a very weak weight to x^2. That means it isn’t really using it. Why?

Well, if you think about it, a polynomial degree 2 is a parabola and a parabola isn’t much better than a line to predict our sample set. Therefore AzureML reverts to a linear predictor despite we gave it the power to use more degrees of freedom!

It’s important you develop the reflex to read results like this. In my toy sample, we can visualize it all and deduce it at a glance, but with typical data sets, the dimension is high and you can’t visualize it.

Here AzureML is telling is: your data is linear!

### Over fitting

Remember I talked about over fitting? That is, the tendency of a learning algorithm to try to fit the sample data perfectly if it can. This typically happens when the learning algorithm has a lot of parameters and the sample data set has little information, i.e. it’s either small or contain records that do not add information (e.g. once you’ve given two points on a line, giving another thousand on the same line doesn’t add information about the data).

Here my data roughly has 8 tops and bottoms if you will. So if I would go and have a polynomial degree 9th, we should be able to match the curve more perfectly.

Let’s go back to the *Apply SQL Transformation* module and change the SQL expression for

*SELECT
X,
X*X AS X2,
X*X*X AS X3,
X*X*X*X AS X4,
X*X*X*X*X AS X5,
X*X*X*X*X*X AS X6,
X*X*X*X*X*X*X AS X7,
X*X*X*X*X*X*X*X AS X8,
X*X*X*X*X*X*X*X*X AS X9,
Y
FROM t1*

Let’s run the experiment again and look at the result of the *Train Model *module.

Now we see that AzureML used the parameters we made available. It is normal that the weight value go down since the data goes up (i.e. x^9 >> x for x>1).

Could we visualize the result?

### Visualizing

The method of visualizing the prediction of an algorithm I found is quite clunky so if you have a better one, please let me know in the commentary.

Basically, you drop a *Writer* module that you connect to the output of the *Score Model* module.

Then you can configure the writer to write in a blob container as CSV. You then take the CSV and paste the last column (the score column) in excel next to the input data. As I said… clunky.

Anyway, if you plot that you’ll get:

The blue dots are the sample set and the orange dots are the predicted points. As you see, the learning algorithm is getting closer to the training data.

Is that a good thing? It depends! If your data was really linear with noise in it, you are training AzureML to predict noise which rarely is possible nor useful. That is over fitting. If your data really has those bumps in it, then there you go.

### Summary

It is possible to implement a Polynomial Regression using *Linear Regression* and *Apply SQL Transformation* modules.

The latter module is quite powerful and can replace both *Project Columns* and *Metadata Editor* modules. You could even do some data clean up using that module (via a where clause).

Thanks for the great article! Simply adding one X*X column using the Apply SQL Transform method you outlined significantly improved the accuracy of my model to a coefficient of 0.97. However, when asking the trained model to predict new values, the accuracy is horrendous; previously if I passed a “0” value to the trained web service, it was within 10%, but now it’s way way far off. Any tips for how to use this approach for more accurate predictions?

Hum… that’s interesting. You might have “over training” but I doubt it would be that pronounced. Can you plot sample points (including training points but also new points) to check how your predictors is looking like?