Tag Archives: Mathematics

Mathematics theory & application

What is Statistics and why should you care?

Unless you graduated in art, chances are you did a course in Statistics.

Chances are you hated it.

Most people I know postponed that course until the end of their degree, didn’t understand much about it and hated it dearly.

I didn’t like it either and understood very little.

A few years later when I studied Machine Learning, I had to review Statistics on my own.  This is when I had the epiphany:  Wow!  This is actually not so complicated and can even be quite interesting!

There are two main reasons I hated my undergraduate course:

  • Examples were all around surveys:  I was studying physics at the time, I didn’t care about those
  • It was really geared towards a collection of recipes:  I love mathematics, elegant theories and understanding what I do, cheat sheets didn’t do it for me

I would like to share my epiphany from back then with you today.  Hopefully it will shade some light on the poorly understood topic of statistics.

This won’t be a deep dive in the science of statistics.  I want to explain what statistics is by capturing where it comes from and giving very simple examples.  I won’t make statisticians out of you today.  Sorry.

Layer Cake

I see statistics as a layer cake.  At the foundation we have combinatorics, then probability and finally at the top, we have statistics.


There lies one of the mistake, in my own opinion, of most statistics course:  they try to get statistics into your head without explaining the two colossus of science it is based on.

Try to explain calculus to somebody who has never seen f(x) = mx + b in 5 minutes and you won’t enlighten them either.

So let’s walk the layer cake from the bottom to the top.


Combinatoricsbranch of mathematics studying finite or countable discrete structures (Wikipedia).

Combinatorics is about counting stuff.  Counting elements in sets (cardinality) and then combining sets together.

You’ve done combinatorics, but you don’t remember, do you?  Let me jog your memory.

Ok, let’s say I have the set A = {1, 2, 3}.  How many pairs can I do with elements of A?  How many trios?  What if order is irrelevant and I just want to know the possible distinct pairs?

Yes, you’ve done that type of problems.  This is where you’ve learned a new meaning for the exclamation mark, i.e. the factorial:  n! = n x (n-1)! (with 0! = 1).

Let’s start an example I’ll carry over in the rest of the article.  Let’s say I have a die with six faces.  The set of possible outcome if I throw it is D = {1, 2, 3, 4, 5, 6}.

How many elements in D?  Six.  Not too hard?  Well, the point isn’t to be hard here.

You can get into quite complicated problems in combinatorics.  I remember an exam question where we had drawers filled with infinite amount of marbles having different colours and we had to mix them together…  that was quite fun.

A good example is the Rubik’s cube (from the Hungarian mathematician Ernő Rubik).  A Rubik cube has 6 faces, each having 9 squares.  6 colours, with 9 squares of each colours, 36 squares in total.  What is the number of possible configurations?  Are some configuration impossible given the physical constraints of the cube?


Probabilitymeasure of the likelihood that an event will occur.  Probability is quantified as a number between 0 and 1 where 0 indicates impossibility and 1 indicates certainty. (Wikipedia)

The canonical example is the toss of a coin.  Head is 1/2 ; tail also is 1/2.

What is an event?  An event is an element of the set of all possible event.  An event occurrence is random.

A special category of event is of great interest:  equipossible events.  Those are events which all have the same chance of occurence.  My coin tossing is like that.  So is my 6-faced die…  if it hasn’t been tempered with.

For those, we have a direct link with combinatorics:

P(event) = \frac{1}{\#events}

The probability of an event is one over the number of possible events.  Let’s come back to my die example:

P(1) =\frac{1}{\#D}=\frac{1}{6}

The probability to get a 1 is 1/6 since #D, the cardinality of D (the number of elements in D), is 6.  Same for all the other events, i.e. 2, 3, 4, 5, 6.

So you see the link with combinatorics?  You always compare things that you’ve counted.

If events aren’t equipossible, you transform your problem until they are.  This is where all the fun resides.

Now, as with combinatorics, you can go to town with probability when you start combining events, conditioning events and…  if you start getting into the objective versus subjective (Bayesian) interpretations.  But again, my goal isn’t to deep dive but just to illustrate what probability is.


Statisticsstudy of the collection, analysis, interpretation, presentation, and organization of data (Wikipedia).

I find that definition quite broad as plotting graph becomes statistics.

I prefer to think of statistics as a special class of probability.  In probability we define models by defining sets of events and the likelihood of events to occur ; in statistics we take samples and check how likely they are according to the model.

A sample is simply a real world trial:  throwing a die, tossing a coin, picking an individual from a population, etc.  .

For instance, given my 6 faces die, let’s say I throw it once and I get a 2.  How confident are we that my die is equipossible with 1/6 probability for each face?

Well…  it’s hard to tell, isn’t it?  A lot of probability model could have given that outcome.  Anything where 2 has a non-zero probability really.

What if I throw twice and get a 2 each time?  Well…  that is possible.  What about three 2?  What about 10?  You’ll start to get skeptical about my die being equipossible as we go, won’t you?

Statistics allow you to quantify that.

When you hear about confidence interval around a survey, that’s exactly what that is about.

If I take my sequence of ‘2’ in my die throwing experiment, we can quantify it this way.  Throwing a ‘2’ has a probability of 1/6, which is the same probability for any result.  Now having the same result n times has a probability P_{same} = (\frac{1}{6})^{n-1} while having a sequence of non-repeating results has probability of P_{different} = (\frac{5}{6})^{n-1}.  So as n increases, it gets less and less likely to have a sequence of the same result.  You can set a threshold and take a decision if you believe the underlying model or not ; in this case, if my die is or not a fair one.

Why should you care?

Statistics is at the core of many empirical sciences.  We use statistics to test theories, to test models.

Also, those three areas of mathematics (i.e. combinatorics, probability & statistics) spawn off into other theories.

For example, information theory is based on probability.  That theory in turn helps us understand signal processing & data compression.

Machine Learning can be interpreted as statistics.  In Machine Learning, we define a class of statistical model and then look at samples (training set) to find the best model fitting that sample set.

Moreover, those area of mathematics allow us to quantify observations.  This is key.  In Data Science, you take a volume of data and you try to make it talk.  Statistics help you do that.

When you take a data set and observe some characteristics (e.g. correlation, dependency, etc.), one of the first thing you’ll want to validate is the good old “is it statistically significant”?  This is basically figuring out if you could observe those characteristics by chance or are they a real characteristic of the data?  For instance, if I look at cars on the freeway and I observe a blue car then a red car a few times, is that chance or is there enough occurrence to think there is a real pattern?

So if you are an executive, you should care about statistics to go beyond just looking (visualizing) the data of your business and understanding, at least at a high level, what type of models and assumptions your data scientists are making on your data.  Are you training models to learn trend in your business?  If so, what are the models look like and how do they perform in terms of prediction?

If you are a developer / architect, you should care about statistics for two big reasons.  First, you are probably instrumental in taking decision on what type of data you collect from the application and at which frequency (e.g. telemetry).  If you log the number of users logged in once a day, your data scientists will have a hard time extracting information from that data.  The second reason is that you are likely going to use data to display and, more and more, to have your application take decision.  Try to understand the data and the models used for decision making.

We live in a world of data abundancy.  Data is spewing from every device & every server.  It is easy to see features in data that are simply noise or do not see feature because they aren’t visible when you visualize data.  Statistics is the key to your data vault.


I hope my article was more insightful than the statistic classes you remember.

Basically, combinatorics studies countable sets.  Probability uses combinatorics to assign probability (value between 0 & 1) to events.  Statistics takes sample and compare them to probability models.

Those fields of study have massive influence in many other fields.  They are key in Machine Learning and Data Science in general.

Where is the statistics in Machine Learning?

I often try to explain what Machine Learning is to people outside the field.  I’m not always good at it but I am getting better.

One of the confusion I often get when I start to elaborate the details is the presence of statistics in Machine Learning.  For people outside the field, statistics are the stuff of survey or their mandatory science class (everyone I know outside of the Mathematics field who had a mandatory Statistic class in their first Uni year ended up postponing it until the end of their undergraduate studies!).

It feels a bit surprising that statistics would get involved while a computer is learning.

So let’s talk about it here.


Back in the days there was two big tracks of Artificial Intelligence.  One track was about codifying our knowledge and then run the code on a Machine.  The other track was about letting the machine learn from the data ; that is Machine Learning in a nutshell.

The first approach has its merit and had a long string of successes such as Expert Systems based on logic and symbolic paradigms.

The best example of those I always give is in vaccine medicine.  The first time I went abroad in tropical countries, in 2000, I went to a vaccine clinic and met with a doctor.  The guy asked me where I was going, consulted a big reference manual, probably used quite a bit of his knowledge and prescribed 3 vaccines his nurse administered to me.  In 2003, when I came back to Canada, I had a similar experience.  In 2009…  nope.  Then I met with a nurse.  She sat behind a computer and asked me similar questions, punching the answer in the software.  The software then prescribed me 7 vaccines to take (yes, it was a more dangerous place to go and I got sick anyway, but that’s another story).

That was an expert system.  Somebody sat with experts and asked them how they proceeded when making a vaccine prescriptions.  They took a lot of notes and codify this into a complex software program.


Expert systems are great at systematizing repetitive tasks requiring a lot of knowledge that can ultimately be described.

If I would ask you to describe me how you differentiate a male from a female face while looking at a photograph, it might get quite hard to codify.  Actually, if you would answer me at all, you would probably give a bad recipe since you do that unconsciously and you aren’t aware of %20 of what your brain does to take the decision.

For those types of problems, Machine Learning tends to do better.

A great example of this is how much better Machine Learning is at solving translation problems.  A turning point was the use of Canadian Hansards (transcripts of Parliamentary Debates in both French & English) to train machines how to translate between the two languages.

Some people, e.g. Chomsky, opposes Machine Learning as it actually hides the codification of knowledge.  Actually there is an interesting field developing to use machines to build explanation out of complicated mathematical proofs.


Nevertheless, why is there statistics in Machine Learning?

image.pngWhen I wrote posts about Machine Learning to explain the basics, I gave an example of linear regression on a 2D data sets.

I glossed over the fact that the data, although vaguely linear, wasn’t a perfect line, far from it.  Nevertheless, the model we used was a line.

This is where a statistical interpretation can come into play.

Basically, we can interpret the data in many ways.  We could say that the data is linear but there were errors in measurements.  We could say that there are much more variables involves that aren’t part of the dataset but we’ll consider those as random since we do not know them.

Both explanation links to statistics and probability.  In the first one we suppose the errors are randomly distributed and explain deviation from linear distribution.  In the second we assume missing data (variables) that if present would make the dataset deterministic but in its absence we model as a random distribution.

Actually, there is a long held debate in physics around the interpretation of Quantum Mechanics that oppose those two explanations.


In our case, the important point is that when we build a machine learning model out of a data set, we assume our model will predict the expectation (or average if you will) of an underlying distribution.


Machine Learning & Statistics are two sides of the same coin.  Unfortunately, they were born from quite different scientific disciplines, have different culture and vocabularies that entertain confusion to this day.  A good article explaining those differences can be found here.