Pandas counting and suming specific conditions returns only nan

I am trying to follow the otherwise excellent solution provided in the thread pandas-counting-and-summing-specific-conditions, but the code only ever outputs nan values, and with sum (not count), gives a future warning.

Basically, for each row in my df, I want to count how many dates in a column are within a range of +/- 1 days of other dates in the same column.

If I were doing it in excel, the following muti-conditional sumproduct or countifs are possible:

= SUMPRODUCT(--(AN2>=$AN$2:$AN$35000-1),--(AN2<=$AN$2:$AN$35000+1)),

or

=countifs($AN$2:$AN$35000,">="&AN2-1,$AN$2:$AN$35000,"<="&AN2+1)

In python, trying the approach in the linked thread, I believe the code would be:

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &
                             (df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()

The output should be: 2, 2, 2, 3, 3, 2. Instead it is wholemeal nan!

Is it correct to assume that because I’m testing conditions, it doesn’t matter if I sum or count? If I need to sum, I get a future warning about invalid columns (the columns are valid), which I don’t understand. But mostly, my question is why am I only getting nan?

Answer

Instead loops in apply is possible use vectorized solution, first create numpy arrays chained by &, compare and for counts Trues is possible use sum:

a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
    
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])

df["caseIntensity"]  = mask.sum(axis=1)
print (df)
       datet  caseIntensity
0 2020-03-04              2
1 2020-03-05              2
2 2020-03-09              2
3 2020-03-10              3
4 2020-03-11              3
5 2020-03-12              2

Here is perfomance for 6k rows:

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),
                         pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),
                         pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)


In [140]: %%timeit
     ...: a = df['datet']
     ...: b = a + pd.Timedelta(days=1)
     ...: c = a - pd.Timedelta(days=1)
     ...:     
     ...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
     ...: 
     ...: df["caseIntensity"]  = mask.sum(axis=1)
     ...: 
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [141]: %%timeit
     ...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
     ...: 
     ...: 
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)