Unexpected slow iteration over DataFrame rows

I want to compare all lines of a csv file against each other. What I want to do is to get the squared sum of the differences between each two rows:

np.sum((row_i-row_j)**2)

I did this using python csv library and it was ok, but I want to use pandas and for some reason, pandas is extremely slow.

My code using csv:

with open(filename) as file:
    reader = csv.reader(file, delimiter=' ', skipinitialspace=True)
    header = reader.__next__()
    table = list(reader)

for i in range(len(table)-1):
    row_i = np.array(table[i][2:-2], dtype='double')
    for j in range(i+1, len(table)):
        row_j = np.array(table[j][2:-2], dtype='double')
        u = np.sum((row_i - row_j)**2)

My code using pandas:

df = pd.read_csv(filename, delim_whitespace=True)
lines_to_remove2=set()
for i in range(len(df)-1):
    row_i = df.iloc[i][2:-2].values
    for j in range(i+1,len(df)):
        row_j = df.iloc[j][2:-2].values
        u = np.sum((row_i - row_j)**2)

As you can see I’m not doing any comparation yet, I’m only assigning values for each row. I will further use the value of u to decide which lines will be deleted. My file has 480 rows which means there are 480*479 iterations, which is not so much. When I iterate over a list, it takes me less than 1s to iterate over all rows, but it takes me many seconds to do it with a pandas.DataFrame. Am I doing something wrong? Is there a faster way of doing this in pandas?

Answer

Your code runs very slow, because you used 2 nested loops and access each row of df by its index.

To get printouts of reasonable size, I used the source DataFrame with only 5 rows:

     A    B
0  1.0  3.5
1  2.0  3.6
2  3.0  4.7
3  4.0  6.8
4  5.0  7.3

Another simplification is that I compute the result for whole rows, instead of “[2:-2]” part.

To get the result of your code, I ran:

for i in range(len(df)-1):
    row_i = df.iloc[i].values
    for j in range(i+1,len(df)):
        row_j = df.iloc[j].values
        difr = row_i - row_j
        u = np.sum(difr**2)
        print(f'{i}, {j}: {row_i}, {row_j}, {difr}, {u}')

getting (with added title row):

i  j  row_i      row_j      difr         sum_of_squares
0, 1: [1.  3.5], [2.  3.6], [-1.  -0.1], 1.01
0, 2: [1.  3.5], [3.  4.7], [-2.  -1.2], 5.44
0, 3: [1.  3.5], [4.  6.8], [-3.  -3.3], 19.89
0, 4: [1.  3.5], [5.  7.3], [-4.  -3.8], 30.439999999999998
1, 2: [2.  3.6], [3.  4.7], [-1.  -1.1], 2.21
1, 3: [2.  3.6], [4.  6.8], [-2.  -3.2], 14.239999999999998
1, 4: [2.  3.6], [5.  7.3], [-3.  -3.7], 22.689999999999998
2, 3: [3.  4.7], [4.  6.8], [-1.  -2.1], 5.409999999999998
2, 4: [3.  4.7], [5.  7.3], [-2.  -2.6], 10.759999999999998
3, 4: [4.  6.8], [5.  7.3], [-1.  -0.5], 1.25

And now how to compute the same result (actually, only the last column of the above printout) much quicker. Run:

arr = df.values
difr = arr[:, np.newaxis] - arr[np.newaxis, :]
difr2 = difr * difr
ss = difr2.sum(axis=2)
result = ss[np.triu_indices(arr.shape[0], 1)]

The result is:

array([ 1.01,  5.44, 19.89, 30.44,  2.21, 14.24, 22.69,  5.41, 10.76, 1.25])

To fully comprehend how this code works, run is stepwise and print the result of each step.

And to use this code in your environment, i.e. take the source DataFrame without first 2 and last 2 columns, change the first line to:

arr = df.values[:, 2:-2]

And as far as execution speed is concerned: I compared execution times of your code (without printout and with added assembling of the sums of squares in a list) with mine code, using %timeit and I got:

  • 5.75 ms for your code,
  • 206 µs for my code.

Almost 28 times faster.

And for a bigger source data sample the difference should be yet greater.

Edit

Yet another remark: Your final goal is (probably) to inspect the resulting sums of squares (for each pair of source rows) and work out a decision which source rows do drop.

So maybe the source for this assessment should be ss array, not the flattened copy of its upper triangle.

Of course, you should still process the upper triangle of ss, but when you operate on ss you have indices of elements, which are equal to indices of both source rows.