I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with `df.interpolate`

, but only if the length of the sequence of nan values is =<N. As an example, let’s assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with

print(df) A B C 1 1 1 nan nan 2 nan nan 3 nan 4 nan 5 5 5

In such a case I want to apply a function on `df`

that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of

print(df) A B C 1 1 1 nan 2 2 nan 3 3 nan 4 4 5 5 5

Note that I am aware of the option of `limit=N`

inside `df.interpolate`

, but it doesn’t fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output

print(df) A B C 1 1 1 2 2 2 3 3 3 nan 4 4 5 5 5

So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx

## Answer

You can perform run length encoding and identify the runs of `NaN`

that are shorter than or equal to two elements for each columns. One way to do that is to use `get_id`

from package `pdrle`

(disclaimer: I wrote it).

import pdrle chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2) df[chk] = df.interpolate()[chk] # A B C # 0 1.0 1.0 1.0 # 1 NaN 2.0 2.0 # 2 NaN 3.0 3.0 # 3 NaN 4.0 4.0 # 4 5.0 5.0 5.0