I want to compare all lines of a csv file against each other. What I want to do is to get the squared sum of the differences between each two rows:

`np.sum((row_i-row_j)**2)`

I did this using python csv library and it was ok, but I want to use pandas and for some reason, pandas is extremely slow.

My code using csv:

with open(filename) as file: reader = csv.reader(file, delimiter=' ', skipinitialspace=True) header = reader.__next__() table = list(reader) for i in range(len(table)-1): row_i = np.array(table[i][2:-2], dtype='double') for j in range(i+1, len(table)): row_j = np.array(table[j][2:-2], dtype='double') u = np.sum((row_i - row_j)**2)

My code using pandas:

df = pd.read_csv(filename, delim_whitespace=True) lines_to_remove2=set() for i in range(len(df)-1): row_i = df.iloc[i][2:-2].values for j in range(i+1,len(df)): row_j = df.iloc[j][2:-2].values u = np.sum((row_i - row_j)**2)

As you can see I’m not doing any comparation yet, I’m only assigning values for each row. I will further use the value of `u`

to decide which lines will be deleted.
My file has 480 rows which means there are 480*479 iterations, which is not so much. When I iterate over a list, it takes me less than 1s to iterate over all rows, but it takes me many seconds to do it with a pandas.DataFrame. Am I doing something wrong? Is there a faster way of doing this in pandas?

## Answer

Your code runs very slow, because you used 2 nested loops and access
each row of *df* by its index.

To get printouts of reasonable size, I used the source DataFrame with only 5 rows:

A B 0 1.0 3.5 1 2.0 3.6 2 3.0 4.7 3 4.0 6.8 4 5.0 7.3

Another simplification is that I compute the result for **whole** rows,
instead of “[2:-2]” part.

To get the result of your code, I ran:

for i in range(len(df)-1): row_i = df.iloc[i].values for j in range(i+1,len(df)): row_j = df.iloc[j].values difr = row_i - row_j u = np.sum(difr**2) print(f'{i}, {j}: {row_i}, {row_j}, {difr}, {u}')

getting (with added title row):

i j row_i row_j difr sum_of_squares 0, 1: [1. 3.5], [2. 3.6], [-1. -0.1], 1.01 0, 2: [1. 3.5], [3. 4.7], [-2. -1.2], 5.44 0, 3: [1. 3.5], [4. 6.8], [-3. -3.3], 19.89 0, 4: [1. 3.5], [5. 7.3], [-4. -3.8], 30.439999999999998 1, 2: [2. 3.6], [3. 4.7], [-1. -1.1], 2.21 1, 3: [2. 3.6], [4. 6.8], [-2. -3.2], 14.239999999999998 1, 4: [2. 3.6], [5. 7.3], [-3. -3.7], 22.689999999999998 2, 3: [3. 4.7], [4. 6.8], [-1. -2.1], 5.409999999999998 2, 4: [3. 4.7], [5. 7.3], [-2. -2.6], 10.759999999999998 3, 4: [4. 6.8], [5. 7.3], [-1. -0.5], 1.25

And now how to compute the same result (actually, only the last column of
the above printout) **much** quicker. Run:

arr = df.values difr = arr[:, np.newaxis] - arr[np.newaxis, :] difr2 = difr * difr ss = difr2.sum(axis=2) result = ss[np.triu_indices(arr.shape[0], 1)]

The result is:

array([ 1.01, 5.44, 19.89, 30.44, 2.21, 14.24, 22.69, 5.41, 10.76, 1.25])

To fully comprehend how this code works, run is stepwise and print the result of each step.

And to use this code in your environment, i.e. take the source DataFrame without first 2 and last 2 columns, change the first line to:

arr = df.values[:, 2:-2]

And as far as execution speed is concerned: I compared execution times
of your code (without printout and with added assembling of the sums
of squares in a list) with mine code, using *%timeit* and I got:

- 5.75 ms for your code,
- 206 µs for my code.

Almost 28 times faster.

And for a bigger source data sample the difference should be yet greater.

# Edit

Yet another remark: Your final goal is (probably) to inspect the resulting sums of squares (for each pair of source rows) and work out a decision which source rows do drop.

So maybe the source for this assessment should be *ss* array, not
the flattened copy of its upper triangle.

Of course, you should still process the upper triangle of *ss*, but
when you operate on *ss* you have indices of elements, which are equal to
indices of both source rows.