# drop_duplicates in a range

I have a datraframe in python like that:

```    st      se      st_min  st_max  se_min  se_max
42  922444  923190  922434  922454  923180  923200
24  922445  923190  922435  922455  923180  923200
43  928718  929456  928708  928728  929446  929466
37  928718  929459  928708  928728  929449  929469
```

As we can see, I have a range in the first 2 columns and a variation of 10 positions of the initial range.

I know that function drop_duplicates can remove duplicate rows based on the exact match of values.

But, if I want to remove rows based on a range of values, for example, both indexes 42 and 24 are in the same range (if I considerer a range of 10) and indexes 43 and 37 are in the same case.

How I can do this?

Ps: I can’t remove based only in one column (e.g. st or se), I need to remove redundancy based on both columns (st and se), using the range of columns min and max as filters…

I assume, you want to combine all ranges. So that all ranges that overlap are reduced to one row. I think you need to do that recursively, because there could be multiple ranges, that form one big range, not just two. You could do it like this (just replace `df` by the variable you use to store your dataframe):

```# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)

# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
diff= new_diff
joined_df= df.merge(right_df, on='fake_key')
invalid_indexer= joined_df['se']<joined_df['st_r']
joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
# update the ..._min / ..._max fields in the combined range
for col in ['st_min', 'se_min', 'st_max', 'se_max']:
col_r= col + '_r'
col1, col2= (col, col_r) if 'min' in col else (col_r, col)
right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
right_df['fake_key']=0
# now check if we need to iterate once more
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
change= diff <= new_diff

# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)
```

If you execute this on your data, you get:

```            st      se  st_min  st_max  se_min  se_max
se_r
923190  922444  923190  922434  922455  923180  923200
929459  928718  929459  922434  928728  923180  929469
```

Note, if your data set is larger than a few thousand records, you might need to change the join logic above which produces a cartesian product. So in the first iteration, you get a `joined_df` of the size n^2, where n is the number of records in your input dataframe. Then later in each iteration the `joined_df` will get smaller due to the aggregation.

I just ignored that, because I don’t know, how large your dataset is. Avoiding this would make the code a bit more complex. But if you need it, you could just create an auxillary dataframe which allows you to “bin” the `se` values on both dataframes and use the binned value as the `fake_key`. It’s not quite regular binning, you would have to create a dataframe that contains for each `fake_key` all values of the in the range (0…`fake_key`). So e.g. if you define your fake key to be `fake_key=se//1000`, your dataframe would contain

```fake_key  fake_key_join
922       922
922       921
922       920
...       ...
922       0
```

If you replace the `merge` in the loop above by code, that merges such a dataframe on `fake_key` with `right_df` and the result on `fake_key_join` with `df` you can use the rest of the code and get the same result as above but without having to produce a full cartesian product.