### Visualizing the True Extent of the Curse of Dimensionality

#### Using the Monte Carlo method to visualize the behavior of observations with very large numbers ofÂ features

Think of a dataset, made of some number of observations, each observation having N features. If you convert all features to a numeric representation, you could say that each observation is a point in an N-dimensional space.

When N is low, the relationships between points are just what you would expect intuitively. But sometimes N grows very largeâ€Šâ€”â€Šthis could happen, for example, if youâ€™re creating a lot of features via one-hot encoding, etc. For very large values of N, observations behave as if they are sparse, or as if the distances between them are somehow bigger than what you wouldÂ expect.

The phenomenon is real. As the number of dimensions N grows, and all else stays the same, the N-volume containing your observations really does increase in a sense (or at least the number of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does become more sparse. This is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset will be influenced as a consequence of theseÂ changes.

Many things can go wrong if the number of features is very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such a space (e.g. GridSearch) becomes less efficientâ€Šâ€”â€Šyou need more trials to cover the same intervals along any axis. A subtle effect has an impact on any models based on distance or vicinity: **as the number of dimensions grows to some very large values, if you consider any point in your observations, all the other points appear to be far away and somehow nearly equidistantâ€Š**â€”â€Šsince these models rely on distance to do their job, the leveling out of differences of distance makes their job much harder. E.g. clustering doesnâ€™t work as well if all points appear to be nearly equidistant.

For all these reasons, and more, techniques such as PCA, LDA, etc. have been createdâ€Šâ€”â€Šin an effort to move away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset down to a number of dimensions more compatible with the actual information contained inÂ it.

It is hard to perceive intuitively the true magnitude of this phenomenon, and spaces with more than 3 dimensions are extremely challenging to visualize, so letâ€™s do some simple 2D visualizations to help our intuition. There is a geometric basis for the reason why dimensionality can become a problem, and this is what we will visualize here. If you have not seen this before, the results might be surprisingâ€Šâ€”â€Šthe geometry of high-dimensional spaces is far more complex than the typical intuition is likely toÂ suggest.

### Volume

Consider a square of size 1, centered in the origin. In the square, you inscribe aÂ circle.

a circle inscribed in aÂ square

That is the setup in 2 dimensions. Now think in the general, N-dimensional case. In 3 dimensions, you have a sphere inscribed in a cube. Beyond that, you have an N-sphere inscribed in an N-cube, which is the most general way to put it. For simplicity, we will refer to these objects as â€śsphereâ€ť and â€ścubeâ€ť, no matter how many dimensions theyÂ have.

The volume of the cube is fixed, itâ€™s always 1. The question is: as the number of dimensions N varies, what happens to the volume of theÂ sphere?

Letâ€™s answer the question experimentally, using the Monte Carlo method. We will generate a very large number of points, distributed uniformly but randomly within the cube. For each point we calculate its distance to the originâ€Šâ€”â€Šif that distance is less than 0.5 (the radius of the sphere), then the point is inside theÂ sphere.

random points

If we divide the number of points inside the sphere by the total number of points, that will approximate the ratio of the volume of the sphere and of the volume of the cube. Since the volume of the cube is 1, the ratio will be equal to the volume of the sphere. The approximation gets better when the total number of points isÂ large.

In other words, the ratio inside_points / total_points will approximate the volume of theÂ sphere.

The code is rather straightforward. Since we need many points, explicit loops must be avoided. We could use NumPy, but itâ€™s CPU-only and single-threaded, so it will be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We will use PyTorchâ€Šâ€”â€Šbut the NumPy code would look almost identical.

If you follow the nested torch statements, we generate 100 million random points, calculate their distances to the origin, count the points inside the sphere, and divide the count by the total number of points. The ratio array will end up containing the volume of the sphere in different numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of memoryâ€Šâ€”â€Šadjust them if your hardware is different.

device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)

# force CPU

# device = ‘cpu’

# reduce d_max if too many ratio values are 0.0

d_max = 22

# reduce n if you run out of memory

n = 10**8

ratio = np.zeros(d_max)

for d in tqdm(range(d_max, 0, -1)):

torch.manual_seed(0)

# combine large tensor statements for better memory allocation

ratio[d – 1] = (

torch.sum(

torch.sqrt(

torch.sum(torch.pow(torch.rand((n, d), device=device) – 0.5, 2), dim=1)

)

<= 0.5

).item()

/ n

)

# clean up memory

torch.cuda.empty_cache()

Letâ€™s visualize theÂ results:

the volume of theÂ sphere

N=1 is trivial: both the â€śsphereâ€ť and the â€ścubeâ€ť are just linear segments, and they are equal, so the â€śvolumeâ€ť of the â€śsphereâ€ť is 1. For N=2 and N=3, you should remember the results from high school geometry, and notice that this simulation is very close to the exactÂ values.

But as N increases, something very dramatic happens: the volume of the sphere drops like a stone. It is already close to 0 when N=10, and it falls below the floating point precision of this simulation shortly after N=20 or so. Beyond that point, the code calculates the volume of the sphere to be 0.0 (itâ€™s just an approximation).

>>> print(list(ratio))

[1.0, 0.78548005, 0.52364381, 0.30841056, 0.16450286, 0.08075666, 0.03688062, 0.015852, 0.00645304, 0.00249584, 0.00092725, 0.00032921, 0.00011305, 3.766e-05, 1.14e-05, 3.29e-06, 9.9e-07, 2.8e-07, 8e-08, 3e-08, 2e-08, 0.0]

For large values of N, almost the whole volume of the cube is located in the corners. The central region, containing the sphere, becomes insignificant in comparison. In high-dimensional spaces, the corners become very important. Most of the volume â€śmigratesâ€ť into the corners. This happens extremely fast as NÂ rises.

Another way to look at it: the sphere is the â€śvicinityâ€ť of the origin. **As N increases, the points simply disappear from that vicinity. The origin is not really all that special, so what happens to its vicinity, must happen to all vicinities everywhere.** The very notion of â€śvicinityâ€ť undergoes a big change. Neighborhoods empty out along with the rise ofÂ N.

Iâ€™ve run this simulation for the first time years ago, and I vividly remember how shocking this result wasâ€Šâ€”â€Šhow quickly the volume of the sphere dive-bombs into 0 as N rises. After checking the code and not finding any errors in it, my reaction was: save the work, lock the screen, go outside, take a walk, and think about what youâ€™ve just seen.Â đź™‚

### Linear distance

Letâ€™s calculate the diagonal of the cube as a function of N. This is trivial, since the diagonal is literally sqrt(N) so the code is too simple to include here. And these are theÂ results:

length ofÂ diagonal

Again, for N=2 and N=3 you should recognize the results from geometry (the diagonal of the square and of the normal 3-cube). But as N increases, the diagonal also increases, and its growth isÂ unbound.

This may sound very counterintuitive, but even as the size of the edge of the cube remains constant (and equal to 1), its diagonal can grow arbitrarily large, as you increase N. Already for N=1000, the diagonal length is around 32. In theory, if the edge of the cube is 1 meter, there will be a space with very many dimensions where the diagonal of the cube will be 1 kilometer!

**Even as the distance along the edge remains the same, the diagonal just keeps growing with the number of dimensions.**

Every time a new dimension is added to the space, more edges, faces, etc. pop into existence, the configuration of the corner region becomes more complex, having more degrees of freedom, and the diagonal will measure a bitÂ longer.

### Distances between observations

How about distances between observations or points? Letâ€™s say we generate a fixed number of random points, and then we calculate the average distance between any two points, and the average distance to the nearest neighbor of any point. All points are always contained in the cube. What happens to the average distances as N increases? Letâ€™s run another simulation.

Please note the conservative approach to memory management. This is important, since the data structures used here are quiteÂ large.

n_points_d = 10**3

# how many pairs of points are there

dist_count = n_points_d * (n_points_d – 1) / 2

# we use the full pair-wise matrix of distances,

# so each distance will be counted twice

dist_count = 2 * dist_count

d_max = d_max_diag

avg_dist = np.zeros(d_max)

avg_dist_nn = np.zeros(d_max)

for d in tqdm(range(d_max, 0, -1)):

torch.manual_seed(0)

# generate random points

point_coordinates = torch.rand((n_points_d, d), device=device)

# compute differences of point coordinates on all axes

coord_diffs = point_coordinates.unsqueeze(1) – point_coordinates

del point_coordinates

# square the coordinate differences

diffs_squared = torch.pow(coord_diffs, 2)

del coord_diffs

# compute distances between any 2 points

distances_full = torch.sqrt(torch.sum(diffs_squared, dim=2))

del diffs_squared

# compute average distance between points

avg_dist[d – 1] = torch.sum(distances_full).item() / dist_count

# compute distances to nearest neighbors

distances_full[distances_full == 0.0] = np.sqrt(d) + 1

distances_nn, _ = torch.min(distances_full, dim=0)

del distances_full

# compute average distance to nearest neighbors

avg_dist_nn[d – 1] = torch.mean(distances_nn).item()

del distances_nn

torch.cuda.empty_cache()

We use far fewer points here (only 1000) because the size of the main data structures increases with N^2, so we would run out of memory very quickly if we generate too many points. Even so, the approximate results should be close enough to the actualÂ values.

This is the average distance between any two points, out of 1000, as a function ofÂ N:

average distance betweenÂ points

For small values of N, the average distance is something like 0.5, or 1. But as N increases, the average distance starts growing, and itâ€™s already close to 18 when N=2000. The increase is substantial.

This is the average distance to the nearest neighbor, as a function ofÂ N:

average distance to the nearestÂ neighbor

The increase here is even more dramatic, as the values are quite tiny when N is below 10 and the points are crowded together in a low-dimensional space. For large values of N, the average distance to the nearest neighbor gets actually close to the average distance between any two pointsâ€Šâ€”â€Šin other words, the emptiness of the immediate vicinity dominates the measurements.

This is why we said at the beginning that **points become nearly equidistant in high-dimensional spaces**â€Šâ€”â€Šaverage distances and shortest distances become nearly the same. Here you have proof of it from the simulation, and an intuition for the strength of the phenomenon.

As the number of dimensions N increases, all points recede away from each other, even though their coordinates are confined to the same narrow range (-0.5, +0.5). The group of points effectively gets more and more sparse by simply creating more dimensions. The increase spans a couple of orders of magnitude when dimensions increase from a few units to a few thousand. Itâ€™s a very large increase.

### Conclusions

The curse of dimensionality is a real phenomenon. All else being equal, as you increase the number of dimensions (or features), the points (or observations) move away from each other quickly. Thereâ€™s effectively more room in there, even though linear intervals along axes are the same. All vicinities empty out, and each point ends up sitting alone in a high-dimensional space.

This should provide some intuition for the fact that some techniques (clustering, various models, etc.) may behave differently when the number of features changes substantially. Few observations, combined with a large number of features, may lead to sub-optimal resultsâ€” and while the curse of dimensionality is not the only reason for that, it can be one of theÂ reasons.

Generally speaking, the number of observations should â€śkeep upâ€ť with the number of featuresâ€Šâ€”â€Šthe exact rules here depend on manyÂ factors.

If nothing else, this article should provide an intuitive overview of the properties of high-dimensional spaces, which are difficult to visualize. Some of the trends are shockingly strong, and now you should have some idea of just how strong theyÂ are.

All code used in this article isÂ here:

misc/curse_dimensionality_article/n_sphere.ipynb at master Â· FlorinAndrei/misc

All images used in this article are created by the author (see the code linkÂ above).

Visualizing the full extent of the curse of dimensionality was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.