I previously talked about the curse of dimensionality (more than 2 years ago) related to Machine Learning.
Here I wanted to discuss it in more depth and dive into the mathematics of it.
High dimensions might sound like Physics’ string theory where our universe is made of more than 4 dimensions. This isn’t what we are talking about here.
The curse of dimensionality is related to what happens when a model deals with a data space with dimensions in the hundreds or thousands.
As the title of this article suggests, we’re going to take the angle of the properties of Hyperspheres (spheres in N dimensions) to explore high dimension properties.
This article is inspired by Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2).
Why should I care about High Dimension?
When introducing Machine Learning concepts, we typically use few dimensions in order to help visualization. For instance, when I introduced linear regression or polynomial regression in past articles, I used datasets in two dimensions and plot them on a chart.
In the real world, typical data sets have much more dimensions.
A typical case of high dimension is image recognition (or character recognition as a sub category) where even a low resolution pictures will have hundreds of pixels. The corresponding model would take gray-scale input vector of dimension 100+.
With fraud detection, transactions do not contain only the value of the transaction, but the time of day, day of week, geo-location, type of commerce, type of products, etc. . This might or might not be a high dimension problem, depending on the available data.
In an e-commerce web site, a Product recommendation algorithm could be as simple as an N x N matrix of 0 to 1 values where N is the number of products.
With IoT, multiple sensors feed a prediction model.
In bioinformatics, DNA sequencing generates a huge amount of data which often is arranged in high dimensional model.
Basically, high dimensions crop up everywhere.
What happens as dimension increases?
For starter a space with more dimensions simply is… bigger. In order to sample a space with 2 dimensions with a resolution of 10 units, we need to have 10^2 = 100 points. Having the same sampling in a space of dimension 3 would require 10^3 = 1000 points. Dimension 20? 20 would require 10^20 = 100 000 000 000 000 000 000 points.
Right off the bat we can tell that sampling the space of dimension 2 & 3 is realistic while for a space of dimension 20, it’s unlikely. Hence we are likely going to suffer from under-sampling.
Yoshua Bengio has a nice discussion about Curse of Dimensionality here.
Hypersphere in a cube
Beyond sampling problems, metrics & measures change behaviour at high dimensions. Intuitively it makes sense since a measure takes a vector (vectors) and squeeze it (them) into a numerical value ; the higher the dimension, the more data we squeeze into one number & hence we should lose information.
We use metrics & measures heavily in Machine Learning. For instance, a lot of cost (or loss) functions are based on Euclidean’s distance:
Now if x and / or y are random variables (e.g. samples), the law of large numbers applies when N becomes large. This implies the sum will trend to the expected value with a narrower standard deviation as N increases. In turns, this means there is less and less information in the distance as the number of dimensions increases.
This brings us to the hypersphere. An hypersphere’s equation is
where x is a point of dimension N and R is the radius of the hypersphere.
An hypersphere of dimension 1 is a line, an hypersphere of dimension 2 is a circle, dimension 3 is a sphere, dimension 4 is an… expending universe? and so on.
A theorem I’ll demonstrate in a future article is that the volume of an hypersphere of radius 1 tends to zero as the dimension increases.
UPDATE (12-07-2017): Demonstration of hypersphere hyper volume is done in this article.
This is fairly unintuitive, so let me give real numerical values:
If we plot those values:
We see the hyper volume increases in the first couple of dimensions. A circle of radius 1 has an area of pi (3.1416) while a sphere of radius 1 has a volume of 4.19. It peaks at dimension 5 and then shrinks.
It is unintuitive because in 2 and 3 dimensions (the only dimensions in which we can visualize an hypersphere), the hypersphere pretty much fills its embedding cube. A way to “visualize” what’s happening in higher dimension is to consider a “diagonal” into an hypersphere.
For a circle, the diagonal (i.e. 45’) intersects with the unit circle at
In general, at dimension N, the diagonal intersects at
So, despite the hypersphere of radius 1 touches the cube of side 2 centered at the origin on each of its walls, the surface of the hypersphere, in general, gets further and further away from the cube surface as the dimension increases.
Consequences of the hypersphere volume
A straightforward consequence of the hypersphere volume is sampling. Randomly sampling a square of side 2 centered at the origin will land points within the unit circle with probability . The same process with an hypersphere of dimension 8 would hit the inside of the hypersphere with a probability of %1.6.
A corollary to the hypersphere volume is that at higher dimension, the bulk of the volume of the hypersphere is concentrated in a thin annulus below its surface. An obvious consequence of that is that optimizing a metric (i.e. a distance) in high dimension is difficult.
What should we do about it?
First step is to be aware of it.
A symptom of high dimensionality is under sampling: the space covered is so large the number of sample points required to learn the underlying model are likely to be over the actual sample set’s size.
The simplest solution is to avoid high dimensionality with some pre-processing. For instance, if we have a priori knowledge of the domain, we might be able to combine dimensions together. For example, in an IoT field with 10 000 sensors, for many reasons, including curse of dimensionality, it wouldn’t be a good idea to consider each sensor inputs as an independent input. It would be worth trying to aggregate out sensor inputs by analyzing the data.
Some Machine Learning algorithms will be more sensitive to higher dimensionality than others but the curse of dimensionality affects most algorithms.
It is a problem to be aware of and we should be ready to mitigate it with some good feature engineering.