Filter of pandas Dataframe based on values of 2 consecutive rows

I have a pandas Dataframe and I want to extract consecutive lines where:

  • two values of a given column correspond to 2 given values (in any order)
  • one value in a column is the same
  • two dates are 1 day apart

To give a concrete example, let’s say I have:

from datetime import datetime
import pandas as pd

df = pd.DataFrame([
    [datetime(2021, 1, 1), "Pizza", 50, "Some Place"],
    [datetime(2021, 1, 2), "Noddles", 36, "Some Place"],
    [datetime(2021, 1, 3), "Rice", 10, "Asian Delice"],
    [datetime(2021, 1, 4), "Pizza", 36, "Some Place"],
    [datetime(2021, 1, 5), "Steak", 75, "Steak House"],
    [datetime(2021, 1, 6), "Pizza", 52, "Another Place"],
    [datetime(2021, 1, 6), "Rice", 10, "Asian Delice"],
    [datetime(2021, 1, 7), "Noddles", 42, "Another Place"],
    [datetime(2021, 1, 8), "Dumplings", 12, "Asian Delice"],
    [datetime(2021, 1, 9), "Noddles", 39, "Some Place"],
    [datetime(2021, 1, 10), "Pizza", 53, "Some Place"],
    [datetime(2021, 1, 13), "Noddles", 0, "Some Place"],
    [datetime(2021, 1, 14), "Pizza", 0, "Another Place"],
], columns=["Date", "Food", "Cost", "Restaurant"])

I want to extract rows where in 2 consecutive days, we have Pizza and Noddles in the same restaurant, so the result would be:

    Date        Food    Cost Restaurant
0   2021-01-01  Pizza   50  Some Place
1   2021-01-02  Noddles 36  Some Place
5   2021-01-06  Pizza   52  Another Place
7   2021-01-07  Noddles 42  Another Place
9   2021-01-09  Noddles 39  Some Place
10  2021-01-10  Pizza   53  Some Place

How could I achieve that with pandas?

Answer

Inspired by @BENY (Thanks Beny) I came up with this solution that does not seem ideal but at least it works.

  1. Filter Dataframe to only keep Noodles and Pizza
  2. Create a new column for ID of restaurant (so we can do a diff to check it is the same)
  3. Diff rows based on date and restaurant ID to obtain a mask (note: we need to diff in both direction because we need 2 matches)
  4. Expected Dataframe would be retrieved by applying the mask

Any suggestion to improve or alternate solution which is more “pandas-ic” is welcome 😉

df = df[df.Food.isin(['Pizza','Noddles'])]
restaurants = list(set(df.Restaurant))
df["RestoID"] = df.apply(lambda row:restaurants.index(row.Restaurant), axis=1)
mask = df.Date.diff().dt.days.le(1) & df.RestoID.diff().eq(0) 
mask |=  df.Date.diff(-1).dt.days.ge(-1) & df.RestoID.diff(-1).eq(0)
df[mask].drop("RestoID", axis=1)

And the result is

    Date        Food   Cost  Restaurant
0   2021-01-01  Pizza   50  Some Place
1   2021-01-02  Noddles 36  Some Place
5   2021-01-06  Pizza   52  Another Place
7   2021-01-07  Noddles 42  Another Place
9   2021-01-09  Noddles 39  Some Place
10  2021-01-10  Pizza   53  Some Place

A better and more elegant solution is to shift rows to perform computations, something like that:

df = df.loc[df.Food.isin(['Pizza','Noddles'])]
mask = False
for i in [-1, 1]:
    mask |= df.Date.diff(i).dt.days.le(i) & df.Food.ne(df.Food.shift(i)) & df.Restaurant.eq(df.Restaurant.shift(i))
df[mask]

Leave a Reply

Your email address will not be published. Required fields are marked *