I’m kind of new in python and I’ve been working on migrating my excel to pandas because it cannot run a hundred of thousands of rows. I have a table that looks like this in excel: Excel table
Where column A and B are inputs and column C is output.
The formula for column C is
If ‘Status’ is equal to “new” the result will be the value in column A
And if ‘Status’ is not equal to “new” the result will be the previous row of C
I tried doing it using
.shift(-1) using this code
df['Previous'] = np.where (df['Status']=='new', df['Count'], df['Previous'].shift(-1))
but it seems I am receiving this error Key error: ‘Previous’ It seems that I need to define the column ‘Previous’ first.
I tried searching the stackoverflow but most of the time the related solutions are somewhat based in complex problems and I was not able to pattern it to my simple problem.
Thanks a lot for your help.
This is the df.columns looks like Index([‘Count’, ‘Status’], dtype=’object’)
This is the result of my code once run. Result
Since you are creating new column
Previous and this column is still not yet defined when you use it in the definition of itself, in the
np.where() statement, you will get an error.
Also, your question is actually not taking a “previous” value since when you are handling the first row, there is no previous value for the first row and even when processing 2nd and 3rd rows the value is still not defined until we go to the 4th row.
So, the solution need to set a kind of temporary non-deterministic value while processing rows still with unknown value and set these non-deterministic values afterwards when some values are defined. In this case, we can set those temporary non-deterministic values as
np.nan and then back-fill using
.bfill() with defined values afterwards. We use backward fill because we are filling values of rows with index 0, 1, 2 by value on row with index 3.
To solve it, you can try the following:
df['Previous'] = np.where(df['Status']=='new', df['Count'], np.nan) df['Previous'] = df['Previous'].bfill().astype(int) print(df) Count Status Previous 0 4 old 1 1 3 old 1 2 2 old 1 3 1 new 1 4 40 old 10 5 30 old 10 6 20 old 10 7 10 new 10 8 400 old 100 9 300 old 100 10 200 old 100 11 100 new 100
Here, I assumed the dtype of column
Count is integer. If it is of string type, then you don’t need to use the
.astype(int) in the code above.
Alternatively, you can also do it in one step using
.where() on column
Count, instead of
np.where() as follows:
df['Previous'] = df['Count'].where(df['Status'] =='new').bfill().astype(int) print(df) Count Status Previous 0 4 old 1 1 3 old 1 2 2 old 1 3 1 new 1 4 40 old 10 5 30 old 10 6 20 old 10 7 10 new 10 8 400 old 100 9 300 old 100 10 200 old 100 11 100 new 100
Similarly, no need to use
.astype(int) in the code above if column
Count is of string type.
.where() is to: “Replace values where the condition is False”. This is some how equivalent to “Retain values where the condition is True”. So when the condition is True, we use the values of original
Count column. Then, you probably would ask: “What if the condition False and what value to replace?” The answer can be seen from the
official document and can be found from the 2nd parameter showing
other=nan. When the condition is False, the value specified in the 2nd parameter
other (if any) will be used. If the 2nd parameter is not specified, it defaults to
nan. Hence, in our case, we don’t specify the 2nd parameter for when the condition is False,
nan will be used for the values. Therefore, same effect as we specify
np.nan for the False condition in the