Hello Developer, Hope you guys are doing great. Today at Tutorial Guruji Official website, we are sharing the answer of Pandas compare value with previous row with filtration condition without wasting too much if your time.
The question is published on by Tutorial Guruji team.
The question is published on by Tutorial Guruji team.
I have a DataFrame with information about employee salary. It’s about 900000+ rows.
Sample:
+----+-------------+---------------+----------+ | | table_num | name | salary | |----+-------------+---------------+----------| | 0 | 001234 | John Johnson | 1200 | | 1 | 001234 | John Johnson | 1000 | | 2 | 001235 | John Johnson | 1000 | | 3 | 001235 | John Johnson | 1200 | | 4 | 001235 | John Johnson | 1000 | | 5 | 001235 | Steve Stevens | 1000 | | 6 | 001236 | Steve Stevens | 1200 | | 7 | 001236 | Steve Stevens | 1200 | | 8 | 001236 | Steve Stevens | 1200 | +----+-------------+---------------+----------+
dtypes:
table_num: string name: string salary: float
I need to add a column with information about increaseddecreased salary level.
I’m using the shift()
function to compare value in rows.
Main problem is in filtration and iteration over all unique employees over the whole dataset.
It takes about 3 and half hour in my script.
How to do it faster?
My script:
# giving us only unique combination of 'table_num' and 'name' # since there can be same 'table_num' for different 'name' # and same names with different 'table_num' appears sometimes names_df = df[['table_num', 'name']].drop_duplicates() # then extracting particular name and table_num from Series for i in range(len(names_df)): ### Bottleneck of whole script ### t = names_df.iloc[i,[0,1]][0] n = names_df.iloc[i,[0,1]][1] # using shift() and lambda to check if there difference between two rows diff_sal = (df[(df['table_num']==t) & ((df['name']==n))]['salary'] - df[(df['table_num']==t) & ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0)) df.loc[diff_sal.index, 'inc'] = diff_sal.values
Sample input data:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'], 'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'], 'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
Sample output:
+----+-------------+---------------+----------+-------+ | | table_num | name | salary | inc | |----+-------------+---------------+----------+-------| | 0 | 001234 | John Johnson | 1200 | 0 | | 1 | 001234 | John Johnson | 1000 | -1 | | 2 | 001235 | John Johnson | 1000 | 0 | | 3 | 001235 | John Johnson | 1200 | 1 | | 4 | 001235 | John Johnson | 1000 | -1 | | 5 | 001235 | Steve Stevens | 1000 | 0 | | 6 | 001236 | Steve Stevens | 1200 | 0 | | 7 | 001236 | Steve Stevens | 1200 | 0 | | 8 | 001236 | Steve Stevens | 1200 | 0 | +----+-------------+---------------+----------+-------+
Answer
Use groupby
together with diff
:
df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0) df.loc[df['inc'] > 0.0, 'inc'] = 1.0 df.loc[df['inc'] < 0.0, 'inc'] = -1.0
We are here to answer your question about Pandas compare value with previous row with filtration condition - If you find the proper solution, please don't forgot to share this with your team members.