Filter df containing specific values within sequence – pandas

I’ve got a somewhat convoluted process of subsetting a df. For a given sequence or rows, I want to return specific values. Specifically, using the df below, the beginning of a sequence is signified by Start_A, Start_B, Start_C. If X or Y is located within a given sequence, I want to return either Up or Down for the same sequence. If Up or Down is not found within the sequence containing X or Y, then return Left or Right. If neither Up or Down nor X or Y is found print an error.

import pandas as pd

df = pd.DataFrame({      
    'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
    'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],        
    })
  
m1 = df['Item'].isin(['X','Y']).cumsum().gt(0)
m2 = df['Item'].isin(['Up','Down']).iloc[::-1].cumsum().gt(0)

df1 = df[m1 & m2]

original df:

    Num     Item
0     1  Start_A # No X,Y within sequence. drop all
1     2       AB
2     3       CD
3     4     Left
4     6  Start_C # X and Up within sequence.
5     7       CD
6     9        X
7    10       Up
8    12    Right
9    13  Start_C # Y and Down within sequence.
10   14       EF
11   15       AB
12   17        Y
13   18       AB
14   19     Down
15   21     Left
16   22  Start_B # Y within sequence. No Up/Down. But Left is.
17   23       AB
18   25        Y
19   26       CD
20   27       AB
21   27     Left
22   28  Start_A # No X,Y within sequence. drop all
23   30       AB
24   31       CD
25   32    Right

intended output:

    Num     Item
6     9        X
7    10       Up
12   17        Y
14   19     Down
18   25        Y
21   27     Left

Answer

Here’s a way of doing it:

import pandas as pd
import numpy as np

df = pd.DataFrame({      
    'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
    'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],        
    })
  

grp = df['Item'].str.startswith('Start_').cumsum()

df['X_Y'] = df['Item'].isin(['X', 'Y'])
df['Up_Down'] = df['Item'].isin(['Up', 'Down'])
df['Left_Right'] = df['Item'].isin(['Left', 'right'])

def f(x):
    if x['X_Y'].any():
        return pd.concat([x[x['X_Y']], x[x['Up_Down']], x[x['Left_Right']]]).head(2)

df.groupby(grp, group_keys=False).apply(f).drop(['X_Y', 'Up_Down', 'Left_Right'], axis=1)

Output:

    Num  Item
6     9     X
7    10    Up
12   17     Y
14   19  Down
18   25     Y
20   27  Left

Details:

  • First, create groups, grp, using cumsum and startswith ‘Start_’.
  • Next, create three boolean series marking ‘X Y’, ‘Up Down’, and ‘Left Right’
  • Then, create a custom function that takes each group, if that group contains a True record for ‘X_Y’, then build a dataframe concatenating ‘X_Y’, ‘Up_Down’, and ‘Left_Right’ in that order. Use head(2) only get the first two records in each group.
  • Drop helper columns after building the resulting dataframe from groupby.