Numpy Select Default Condition Returns Wrong Value

I have the following code:

datetime_const = datetime(2021, 3, 31)
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime1'], format='%Y-%m-%d')
tmp_df1['test_col_1'] = (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12)))
tmp_df1['test_col_2'] = (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
tmp_df1['test_col_3'] = datetime_const + pd.DateOffset(months=12)
tmp_df1['test_col_4'] = datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
tmp_df1['test_col_5'] = tmp_df1['datetime2']
tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
)

datetime1 is an object dtype, so i converted it to datetime64, as datetime2 is assigned as.

value1 is a float dtype column with a bunch of decimal numbers, it does have NaNs.

I created test_col_1 to test_col_5 to check the individual conditions and choices within my np.select function, they all seem correct when assigned as individual df columns.

However, my datetime3 column assignment, from the np.select function, returns some weird object dtype large numbers, like 160000000000. I would expect it to return either a datetime64 value from one of the two choices, or the default datetime2 column value.

Please see the sample .info and df rows below:

Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   datetime2                   26558 non-null  datetime64[ns]
 1   value1                      25438 non-null  float64       
 2   test_col_1                  26558 non-null  bool          
 3   test_col_2                  26558 non-null  bool          
 4   test_col_3                  26558 non-null  datetime64[ns]
 5   test_col_4                  25438 non-null  datetime64[ns]
 6   test_col_5                  26558 non-null  datetime64[ns]
 7   datetime3                   26558 non-null  object        
dtypes: bool(2), datetime64[ns](4), float64(1), object(1)
memory usage: 1.5+ MB

            datetime2   value1  test_col_1  test_col_2 test_col_3 test_col_4 test_col_5        datetime3
0           2021-06-30 0.00058       False        True 2022-03-31 2021-08-05 2021-06-30        1628121600000000000
1           2022-03-31 0.00044       False       False 2022-03-31 2021-09-13 2022-03-31        1648684800000000000
2           2024-06-07 0.00860       False       False 2022-03-31 2021-04-08 2024-06-07        1717718400000000000
3           2021-09-30 0.00867       False       False 2022-03-31 2021-04-08 2021-09-30        1632960000000000000
4           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
5           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
6           2021-04-08 0.00474       False        True 2022-03-31 2021-04-15 2021-04-08        1618444800000000000
7           2023-10-01 0.11506       False       False 2022-03-31 2021-04-01 2023-10-01        1696118400000000000
8           2023-09-29 0.12067       False       False 2022-03-31 2021-04-01 2023-09-29        1695945600000000000
9           2021-05-31 0.02508       False       False 2022-03-31 2021-04-03 2021-05-31        1622419200000000000

I am completely baffled by this behavior, please enlighten me!

Thank you all in advance!

Answer

It looks like there is conversion of the dates to the representation in int64 from epoch time when using np.select. An easy fix is to convert after with astype

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')


tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
).astype('datetime64[ns]') ### <--- add this

print(tmp_df1)
   datetime2   value1  datetime3
0 2021-06-30  0.00058 2021-08-04
1 2023-10-01  0.11506 2023-10-01

Longer explanation

I think that the problem is in your two choices, because one of them is a single value (the first one) and the second is a Series. You can see that it works when the second choice is a Series too (with datetime dtype)

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')

if I use your method I get the long integer representation (like you)

np.select(
    ...
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],...
)
# gives
array([1628035200000000000, 1696118400000000000], dtype=object)

but replacing the datetime_const in the first choice by creating a Series (not related to your use case)

np.select(
    ...
    [
        tmp_df1['datetime2'] + pd.DateOffset(months=12), # here replace the constant by the column datetime2 for example
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    ...
)
# get the good date format (wrong value of course)
array(['2021-08-04T00:00:00.000000000', '2023-10-01T00:00:00.000000000'],
      dtype='datetime64[ns]')