Missing values NaN replace conditional Python

I have a dataset containing the values of several indicators of a series of countries and years (3072,1134 rows and columns), but some NaN.

Find below a sample of the dataset:

Year Country Submitted Age12 Age14
2017 CHI 1 267 136
2018 CHI 1 267 NaN
2019 CHI NaN NaN NaN
2020 CHI 1 244 203
2017 ALB 1 160 148
2018 ALB 1 163 165
2019 ALB 1 NaN NaN
2020 ALB 1 161 NaN
2017 GER 1 NaN 320
2018 GER 1 451 381
2019 GER NaN NaN NaN
2020 GER 1 361 321
2017 BRI 1 NaN NaN
2018 BRI 1 NaN NaN
2019 BRI NaN NaN NaN
2020 BRI 1 322 298

I would like to replace missing values based on the values of the column Submitted.

I would like to change the values by the previous row only if the respective Submitted value is “NaN”.

And this is what I would like to have:

Year Country Submitted Age12 Age14
2017 CHI 1 267 136
2018 CHI 1 267 NaN
2019 CHI 1 267 NaN
2020 CHI 1 244 203
2017 ALB 1 160 148
2018 ALB 1 163 165
2019 ALB 1 NaN NaN
2020 ALB 1 161 NaN
2017 GER 1 NaN 320
2018 GER 1 451 381
2019 GER 1 451 381
2020 GER 1 361 321
2017 BRI 1 NaN NaN
2018 BRI 1 NaN NaN
2019 BRI 1 NaN NaN
2020 BRI 1 322 298

With some help from the community, I tried using

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))

and

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill',limit=2))

and

df = df.where(~df.Sumbitted.isnull(), interpolate(method="pad", limit=2))

However, using these options I get the following instead. The issue is that the code retrieves the previous non-NaN value, but in the cases that the previous value is a NaN the code retrieves the value before the previous until its finds a non-NaN value.

Output:

Year Country Submitted Age12 Age14
2017 CHI 1 267 136
2018 CHI 1 267 NaN
2019 CHI 1 267 136
2020 CHI 1 244 203
2017 ALB 1 160 148
2018 ALB 1 163 165
2019 ALB 1 NaN NaN
2020 ALB 1 161 NaN
2017 GER 1 NaN 320
2018 GER 1 451 381
2019 GER 1 451 381
2020 GER 1 361 321
2017 BRI 1 NaN NaN
2018 BRI 1 NaN NaN
2019 BRI 1 361 321
2020 BRI 1 322 298

See the wrong outputs of CHI 2019 Age14 and BRI Age12 and Age14

Would you know how to solve this issue by fixing this code? Solutions using a loop or selecting these 2 columns (Age14 and Age12) are not optimal as I have several columns (“…” in the example) and need a systematic solution.

Thank you

Answer

You can use the parameter limit=1 to limit filling only from the previous one row value and not from further up:

df = df.where(~df.Submitted.isnull(), df.fillna(axis=0, method='ffill',limit=1))

Or better still, simplify your codes, as follows:

Use .mask() instead of .where so that your code don’t need extra step to negate the boolean mask. Also, use ffill() which is a simplified form of .fillna(axis=0, method='ffill'):

df = df.mask(df.Submitted.isna(), df.ffill(limit=1))

Result:

print(df)

    Year Country  Submitted  Age12  Age14
0   2017     CHI        1.0  267.0  136.0
1   2018     CHI        1.0  267.0    NaN
2   2019     CHI        1.0  267.0    NaN
3   2020     CHI        1.0  244.0  203.0
4   2017     ALB        1.0  160.0  148.0
5   2018     ALB        1.0  163.0  165.0
6   2019     ALB        1.0    NaN    NaN
7   2020     ALB        1.0  161.0    NaN
8   2017     GER        1.0    NaN  320.0
9   2018     GER        1.0  451.0  381.0
10  2019     GER        1.0  451.0  381.0
11  2020     GER        1.0  361.0  321.0
12  2017     BRI        1.0    NaN    NaN
13  2018     BRI        1.0    NaN    NaN
14  2019     BRI        1.0    NaN    NaN
15  2020     BRI        1.0  322.0  298.0