Tag Archives: Mathematics

Mathematics theory & application

Hyperspheres & the curse of dimensionality

fractal-1118515_640I previously talked about the curse of dimensionality (more than 2 years ago) related to Machine Learning.

Here I wanted to discuss it in more depth and dive into the mathematics of it.

High dimensions might sound like Physics’ string theory where our universe is made of more than 4 dimensions.  This isn’t what we are talking about here.

The curse of dimensionality is related to what happens when a model deals with a data space with dimensions in the hundreds or thousands.

As the title of this article suggests, we’re going to take the angle of the properties of Hyperspheres (spheres in N dimensions) to explore high dimension properties.

This article is inspired by Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2).

Why should I care about High Dimension?

When introducing Machine Learning concepts, we typically use few dimensions in order to help visualization.  For instance, when I introduced linear regression or polynomial regression in past articles, I used datasets in two dimensions and plot them on a chart.

Brown RabbitIn the real world, typical data sets have much more dimensions.

A typical case of high dimension is image recognition (or character recognition as a sub category) where even a low resolution pictures will have hundreds of pixels.  The corresponding model would take gray-scale input vector of dimension 100+.

Close-up of an Animal Eating GrassWith fraud detection, transactions do not contain only the value of the transaction, but the time of day, day of week, geo-location, type of commerce, type of products, etc.  .  This might or might not be a high dimension problem, depending on the available data.

In an e-commerce web site, a Product recommendation algorithm could be as simple as an N x N matrix of 0 to 1 values where N is the number of products.

With IoT, multiple sensors feed a prediction model.

In bioinformatics, DNA sequencing generates a huge amount of data which often is arranged in high dimensional model.

Basically, high dimensions crop up everywhere.

What happens as dimension increases?

For starter a space with more dimensions simply is…  bigger.  In order to sample a space with 2 dimensions with a resolution of 10 units, we need to have 10^2 = 100 points.  Having the same sampling in a space of dimension 3 would require 10^3 = 1000 points.  Dimension 20?  20 would require 10^20 = 100 000 000 000 000 000 000 points.

Right off the bat we can tell that sampling the space of dimension 2 & 3 is realistic while for a space of dimension 20, it’s unlikely.  Hence we are likely going to suffer from under-sampling.

Yoshua Bengio has a nice discussion about Curse of Dimensionality here.

Hypersphere in a cube

Tapemeasure on 20Beyond sampling problems, metrics & measures change behaviour at high dimensions.  Intuitively it makes sense since a measure takes a vector (vectors) and squeeze it (them) into a numerical value ; the higher the dimension, the more data we squeeze into one number & hence we should lose information.

We use metrics & measures heavily in Machine Learning.  For instance, a lot of cost (or loss) functions are based on Euclidean’s distance:

dist(x,y) = \displaystyle\sum_{i=1}^N (x_i-y_i)^2

Now if x and / or y are random variables (e.g. samples), the law of large numbers applies when N becomes large.  This implies the sum will trend to the expected value with a narrower standard deviation as N increases.  In turns, this means there is less and less information in the distance as the number of dimensions increases.

This brings us to the hypersphere.  An hypersphere’s equation is

\displaystyle\sum_{i=1}^N x_i^2 = R^2

where x is a point of dimension N and R is the radius of the hypersphere.

An hypersphere of dimension 1 is a line, an hypersphere of dimension 2 is a circle, dimension 3 is a sphere, dimension 4 is an…  expending universe?  and so on.

A theorem I’ll demonstrate in a future article is that the volume of an hypersphere of radius 1 tends to zero as the dimension increases.

UPDATE (12-07-2017):  Demonstration of hypersphere hyper volume is done in this article.

This is fairly unintuitive, so let me give real numerical values:

Dimension Hyper Volume
1 2
2 3.141592654
3 4.188790205
4 4.934802201
5 5.263789014
6 5.16771278
7 4.72476597
8 4.058712126
9 3.298508903
10 2.55016404
11 1.884103879
12 1.335262769
13 0.910628755
14 0.599264529
15 0.381443281
16 0.23533063
17 0.140981107
18 0.082145887
19 0.046621601
20 0.025806891
21 0.01394915
22 0.007370431
23 0.003810656
24 0.001929574
25 0.000957722
26 0.000466303
27 0.000222872
28 0.000104638

If we plot those values:

image

We see the hyper volume increases in the first couple of dimensions.  A circle of radius 1 has an area of pi (3.1416) while a sphere of radius 1 has a volume of 4.19.  It peaks at dimension 5 and then shrinks.

It is unintuitive because in 2 and 3 dimensions (the only dimensions in which we can visualize an hypersphere), the hypersphere pretty much fills its embedding cube.  A way to “visualize” what’s happening in higher dimension is to consider a “diagonal” into an hypersphere.

For a circle, the diagonal (i.e. 45’) intersects with the unit circle at

(\frac {1} {\sqrt {2}}, \frac {1} {\sqrt {2}}) since (\frac {1} {\sqrt {2}})^2 + (\frac {1} {\sqrt {2}})^2 = 1^2

In general, at dimension N, the diagonal intersects at

x_i = \frac {1} {\sqrt {N}}

So, despite the hypersphere of radius 1 touches the cube of side 2 centered at the origin on each of its walls, the surface of the hypersphere, in general, gets further and further away from the cube surface as the dimension increases.

Consequences of the hypersphere volume

A straightforward consequence of the hypersphere volume is sampling.  Randomly sampling a square of side 2 centered at the origin will land points within the unit circle with probability \frac{\pi}{4} = \%79.  The same process with an hypersphere of dimension 8 would hit the inside of the hypersphere with a probability of %1.6.

A corollary to the hypersphere volume is that at higher dimension, the bulk of the volume of the hypersphere is concentrated in a thin annulus below its surface.  An obvious consequence of that is that optimizing a metric (i.e. a distance) in high dimension is difficult.

What should we do about it?

First step is to be aware of it.

A symptom of high dimensionality is under sampling:  the space covered is so large the number of sample points required to learn the underlying model are likely to be over the actual sample set’s size.

The simplest solution is to avoid high dimensionality with some pre-processing.  For instance, if we have a priori knowledge of the domain, we might be able to combine dimensions together.  For example, in an IoT field with 10 000 sensors, for many reasons, including curse of dimensionality, it wouldn’t be a good idea to consider each sensor inputs as an independent input.  It would be worth trying to aggregate out sensor inputs by analyzing the data.

Summary

Some Machine Learning algorithms will be more sensitive to higher dimensionality than others but the curse of dimensionality affects most algorithms.

It is a problem to be aware of and we should be ready to mitigate it with some good feature engineering.

What is Statistics and why should you care?

Unless you graduated in art, chances are you did a course in Statistics.

Chances are you hated it.

Most people I know postponed that course until the end of their degree, didn’t understand much about it and hated it dearly.

I didn’t like it either and understood very little.

A few years later when I studied Machine Learning, I had to review Statistics on my own.  This is when I had the epiphany:  Wow!  This is actually not so complicated and can even be quite interesting!

There are two main reasons I hated my undergraduate course:

  • Examples were all around surveys:  I was studying physics at the time, I didn’t care about those
  • It was really geared towards a collection of recipes:  I love mathematics, elegant theories and understanding what I do, cheat sheets didn’t do it for me

I would like to share my epiphany from back then with you today.  Hopefully it will shade some light on the poorly understood topic of statistics.

This won’t be a deep dive in the science of statistics.  I want to explain what statistics is by capturing where it comes from and giving very simple examples.  I won’t make statisticians out of you today.  Sorry.

Layer Cake

I see statistics as a layer cake.  At the foundation we have combinatorics, then probability and finally at the top, we have statistics.

image

There lies one of the mistake, in my own opinion, of most statistics course:  they try to get statistics into your head without explaining the two colossus of science it is based on.

Try to explain calculus to somebody who has never seen f(x) = mx + b in 5 minutes and you won’t enlighten them either.

So let’s walk the layer cake from the bottom to the top.

Combinatorics

Combinatoricsbranch of mathematics studying finite or countable discrete structures (Wikipedia).

Combinatorics is about counting stuff.  Counting elements in sets (cardinality) and then combining sets together.

You’ve done combinatorics, but you don’t remember, do you?  Let me jog your memory.

Ok, let’s say I have the set A = {1, 2, 3}.  How many pairs can I do with elements of A?  How many trios?  What if order is irrelevant and I just want to know the possible distinct pairs?

Yes, you’ve done that type of problems.  This is where you’ve learned a new meaning for the exclamation mark, i.e. the factorial:  n! = n x (n-1)! (with 0! = 1).

Let’s start an example I’ll carry over in the rest of the article.  Let’s say I have a die with six faces.  The set of possible outcome if I throw it is D = {1, 2, 3, 4, 5, 6}.

How many elements in D?  Six.  Not too hard?  Well, the point isn’t to be hard here.

You can get into quite complicated problems in combinatorics.  I remember an exam question where we had drawers filled with infinite amount of marbles having different colours and we had to mix them together…  that was quite fun.

A good example is the Rubik’s cube (from the Hungarian mathematician Ernő Rubik).  A Rubik cube has 6 faces, each having 9 squares.  6 colours, with 9 squares of each colours, 36 squares in total.  What is the number of possible configurations?  Are some configuration impossible given the physical constraints of the cube?

Probability

Probabilitymeasure of the likelihood that an event will occur.  Probability is quantified as a number between 0 and 1 where 0 indicates impossibility and 1 indicates certainty. (Wikipedia)

The canonical example is the toss of a coin.  Head is 1/2 ; tail also is 1/2.

What is an event?  An event is an element of the set of all possible event.  An event occurrence is random.

A special category of event is of great interest:  equipossible events.  Those are events which all have the same chance of occurence.  My coin tossing is like that.  So is my 6-faced die…  if it hasn’t been tempered with.

For those, we have a direct link with combinatorics:

P(event) = \frac{1}{\#events}

The probability of an event is one over the number of possible events.  Let’s come back to my die example:

P(1) =\frac{1}{\#D}=\frac{1}{6}

The probability to get a 1 is 1/6 since #D, the cardinality of D (the number of elements in D), is 6.  Same for all the other events, i.e. 2, 3, 4, 5, 6.

So you see the link with combinatorics?  You always compare things that you’ve counted.

If events aren’t equipossible, you transform your problem until they are.  This is where all the fun resides.

Now, as with combinatorics, you can go to town with probability when you start combining events, conditioning events and…  if you start getting into the objective versus subjective (Bayesian) interpretations.  But again, my goal isn’t to deep dive but just to illustrate what probability is.

Statistics

Statisticsstudy of the collection, analysis, interpretation, presentation, and organization of data (Wikipedia).

I find that definition quite broad as plotting graph becomes statistics.

I prefer to think of statistics as a special class of probability.  In probability we define models by defining sets of events and the likelihood of events to occur ; in statistics we take samples and check how likely they are according to the model.

A sample is simply a real world trial:  throwing a die, tossing a coin, picking an individual from a population, etc.  .

For instance, given my 6 faces die, let’s say I throw it once and I get a 2.  How confident are we that my die is equipossible with 1/6 probability for each face?

Well…  it’s hard to tell, isn’t it?  A lot of probability model could have given that outcome.  Anything where 2 has a non-zero probability really.

What if I throw twice and get a 2 each time?  Well…  that is possible.  What about three 2?  What about 10?  You’ll start to get skeptical about my die being equipossible as we go, won’t you?

Statistics allow you to quantify that.

When you hear about confidence interval around a survey, that’s exactly what that is about.

If I take my sequence of ‘2’ in my die throwing experiment, we can quantify it this way.  Throwing a ‘2’ has a probability of 1/6, which is the same probability for any result.  Now having the same result n times has a probability P_{same} = (\frac{1}{6})^{n-1} while having a sequence of non-repeating results has probability of P_{different} = (\frac{5}{6})^{n-1}.  So as n increases, it gets less and less likely to have a sequence of the same result.  You can set a threshold and take a decision if you believe the underlying model or not ; in this case, if my die is or not a fair one.

Why should you care?

Statistics is at the core of many empirical sciences.  We use statistics to test theories, to test models.

Also, those three areas of mathematics (i.e. combinatorics, probability & statistics) spawn off into other theories.

For example, information theory is based on probability.  That theory in turn helps us understand signal processing & data compression.

Machine Learning can be interpreted as statistics.  In Machine Learning, we define a class of statistical model and then look at samples (training set) to find the best model fitting that sample set.

Moreover, those area of mathematics allow us to quantify observations.  This is key.  In Data Science, you take a volume of data and you try to make it talk.  Statistics help you do that.

When you take a data set and observe some characteristics (e.g. correlation, dependency, etc.), one of the first thing you’ll want to validate is the good old “is it statistically significant”?  This is basically figuring out if you could observe those characteristics by chance or are they a real characteristic of the data?  For instance, if I look at cars on the freeway and I observe a blue car then a red car a few times, is that chance or is there enough occurrence to think there is a real pattern?

So if you are an executive, you should care about statistics to go beyond just looking (visualizing) the data of your business and understanding, at least at a high level, what type of models and assumptions your data scientists are making on your data.  Are you training models to learn trend in your business?  If so, what are the models look like and how do they perform in terms of prediction?

If you are a developer / architect, you should care about statistics for two big reasons.  First, you are probably instrumental in taking decision on what type of data you collect from the application and at which frequency (e.g. telemetry).  If you log the number of users logged in once a day, your data scientists will have a hard time extracting information from that data.  The second reason is that you are likely going to use data to display and, more and more, to have your application take decision.  Try to understand the data and the models used for decision making.

We live in a world of data abundancy.  Data is spewing from every device & every server.  It is easy to see features in data that are simply noise or do not see feature because they aren’t visible when you visualize data.  Statistics is the key to your data vault.

Summary

I hope my article was more insightful than the statistic classes you remember.

Basically, combinatorics studies countable sets.  Probability uses combinatorics to assign probability (value between 0 & 1) to events.  Statistics takes sample and compare them to probability models.

Those fields of study have massive influence in many other fields.  They are key in Machine Learning and Data Science in general.

Where is the statistics in Machine Learning?

I often try to explain what Machine Learning is to people outside the field.  I’m not always good at it but I am getting better.

One of the confusion I often get when I start to elaborate the details is the presence of statistics in Machine Learning.  For people outside the field, statistics are the stuff of survey or their mandatory science class (everyone I know outside of the Mathematics field who had a mandatory Statistic class in their first Uni year ended up postponing it until the end of their undergraduate studies!).

It feels a bit surprising that statistics would get involved while a computer is learning.

So let’s talk about it here.

 

Back in the days there was two big tracks of Artificial Intelligence.  One track was about codifying our knowledge and then run the code on a Machine.  The other track was about letting the machine learn from the data ; that is Machine Learning in a nutshell.

The first approach has its merit and had a long string of successes such as Expert Systems based on logic and symbolic paradigms.

The best example of those I always give is in vaccine medicine.  The first time I went abroad in tropical countries, in 2000, I went to a vaccine clinic and met with a doctor.  The guy asked me where I was going, consulted a big reference manual, probably used quite a bit of his knowledge and prescribed 3 vaccines his nurse administered to me.  In 2003, when I came back to Canada, I had a similar experience.  In 2009…  nope.  Then I met with a nurse.  She sat behind a computer and asked me similar questions, punching the answer in the software.  The software then prescribed me 7 vaccines to take (yes, it was a more dangerous place to go and I got sick anyway, but that’s another story).

That was an expert system.  Somebody sat with experts and asked them how they proceeded when making a vaccine prescriptions.  They took a lot of notes and codify this into a complex software program.

 

Expert systems are great at systematizing repetitive tasks requiring a lot of knowledge that can ultimately be described.

If I would ask you to describe me how you differentiate a male from a female face while looking at a photograph, it might get quite hard to codify.  Actually, if you would answer me at all, you would probably give a bad recipe since you do that unconsciously and you aren’t aware of %20 of what your brain does to take the decision.

For those types of problems, Machine Learning tends to do better.

A great example of this is how much better Machine Learning is at solving translation problems.  A turning point was the use of Canadian Hansards (transcripts of Parliamentary Debates in both French & English) to train machines how to translate between the two languages.

Some people, e.g. Chomsky, opposes Machine Learning as it actually hides the codification of knowledge.  Actually there is an interesting field developing to use machines to build explanation out of complicated mathematical proofs.

 

Nevertheless, why is there statistics in Machine Learning?

image.pngWhen I wrote posts about Machine Learning to explain the basics, I gave an example of linear regression on a 2D data sets.

I glossed over the fact that the data, although vaguely linear, wasn’t a perfect line, far from it.  Nevertheless, the model we used was a line.

This is where a statistical interpretation can come into play.

Basically, we can interpret the data in many ways.  We could say that the data is linear but there were errors in measurements.  We could say that there are much more variables involves that aren’t part of the dataset but we’ll consider those as random since we do not know them.

Both explanation links to statistics and probability.  In the first one we suppose the errors are randomly distributed and explain deviation from linear distribution.  In the second we assume missing data (variables) that if present would make the dataset deterministic but in its absence we model as a random distribution.

Actually, there is a long held debate in physics around the interpretation of Quantum Mechanics that oppose those two explanations.

 

In our case, the important point is that when we build a machine learning model out of a data set, we assume our model will predict the expectation (or average if you will) of an underlying distribution.

 

Machine Learning & Statistics are two sides of the same coin.  Unfortunately, they were born from quite different scientific disciplines, have different culture and vocabularies that entertain confusion to this day.  A good article explaining those differences can be found here.