I’m trying to demonstrate the curse of dimensionality in Python. Here’s how the curve should look like:
I want to do this for dimension from 2 to 30 with a step size of 1 and for each dimension I want to generate 100 random data points. Can anyone let me know how can I go about doing this? Here’s my current code:
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import os import math from random import randint
Below, I’m generating random values to variables x and y and then I calculate the Euclidean distance using np.linalg.norm. After that, I store the value in a list.
x=np.array([randint(0, 101),randint(0, 101)]) y=np.array([randint(0, 101),randint(0, 101)]) ed_list =  d=np.array([1,2]) ed = np.linalg.norm(x - y) ed_list.append(ed) ed_list print(x)
Now, I use a for loop to generate the more random values and then store and plot it.
for i in range(2,50): xval = randint(0, 101) yval = randint(0, 101) x = np.append(x,xval) y = np.append(y,yval) d= np.append(d,i+1) try: ed = np.linalg.norm(x - y) ed_list.append(ed) ed = 0 except: ed = 0 print(x, y) plt.plot(ed_list) plt.xlabel('Number of dimensions') plt.ylabel('Euclidean Distance') plt.show()
However, my curve looks like this because I’m just plotting the Euclidean distance and not calculating according to the formula in the above Curse of Dimensionality picture.
So I have 2 questions. How can add the formula that is given in the curse of dimensionality picture. Also, how do I do this for dimension 2 to 30 and for a random 100 data points as I have stated above. If anyone can help me on this, that would be great!
All you are showing is the distance between two points as you add extra dimensions. Naturally, that will keep growing larger and larger. The curse of dimensionality says that, given a whole set of points,
P, and a reference point,
Q, if you compute the distance from
Q to each of the points in
P, the difference between the maximum distance and the minimum distance goes to 0. You need WAY more than 2 points to show that.
For each dimension, generate a bunch of points (I picked 10000), pick another random point, compute the distance to each, grab the min and max from those distances, and plot the result.
import numpy as np import matplotlib.pyplot as plt import os import math deltas =  for N in range(2,50): # Generate 1000 random points in N dimensions. P = [np.random.randint(-100, 100, N) for _ in range(10000)] Q = np.random.randint(-100,100,N) diffs = [np.linalg.norm(p-Q) for p in P] mxd = max(diffs) mnd = min(diffs) delta = math.log10(mxd-mnd)/mnd deltas.append( delta ) plt.plot(range(2,50),deltas) plt.xlabel('Number of dimensions') plt.ylabel('Euclidean Distance') plt.show()