Count duplicate (but unique records) + consecutive records

I’m looking to count the number of invoices that are consecutive + duplicate dates (but unique invoices). I tried converting datetime to string then counting sequences but it only counts unique dates per invoice.

# fix time format
df['Invoice Date'] = pd.to_datetime(df['Invoice Date'])

# column to find sequences
df = df.sort_values(['Vendor ID', 'Vendor Name', 'Invoice Date'])
df['continue'] = df['Invoice Date'].diff().dt.days.ne(1).cumsum()

# find time sequences
val_count = df['continue'].value_counts()
val_count = val_count[val_count.gt(1)].index.tolist()


#filter and group by Vendor
output = df[df['continue'].isin(val_count)].groupby(['Vendor ID'])['Vendor Name'].count()

Current Input:

Vendor ID Vendor Name Invoice Date Invoice Number
8555 Macys 1/1/2020 1
8555 Macys 1/2/2020 2
8555 Macys 1/3/2020 3
8555 Macys 1/3/2020 4
8555 Macys 1/3/2020 5
8555 Macys 1/5/2020 6
1044 JC Penny 2/2/2020 7
1044 JC Penny 2/3/2020 8
1044 JC Penny 2/3/2020 9
1044 JC Penny 2/8/2020 10
1044 JC Penny 2/8/2020 11

Expected Output:

Vendor ID Vendor Name Count of Consecutive Invoices
8555 Macys 5
1044 JC Penny 5

**August 24:**I’M EDITING MY INPUT AND EXPECTED OUTPUT. The code everyone’s suggesting was working on the previous example I gave but not on my actual data. Adding onto it so it reflects what I’m running into. I’ve used all code suggestions and nothing is giving Macys 5, JC Penny 5.

  • Macys has a sum of 5. Invoices 1-5 are consecutive and contain duplicates.
  • JC Penny has a sum of 5. Invoices 7-11 are include consecutive of 2 days and contains duplicates.

**August 25:**Adding a new example with current code (with the help of @Corralien) and dataframe

d = {'Vendor ID':
         {0: 8555, 1: 8555, 2: 8555, 3: 8555, 4: 8555, 5: 8555, 6: 8555,7: 8555, 8: 8555,9: 8555, 10: 8555, 11: 8555, 12: 8555, 13: 8555,
                   14: 1044, 15: 1044, 16: 1044, 17: 1044, 18: 1044, 19: 1044, 20: 1044, 21: 1044},
 'Vendor Name':
         {0: 'Macys', 1: 'Macys', 2: 'Macys', 3: 'Macys', 4: 'Macys',5: 'Macys', 6: 'Macys', 7: 'Macys', 8: 'Macys', 9: 'Macys', 10: 'Macys', 11: 'Macys', 12: 'Macys', 13: 'Macys',
                  14: 'JC Penny', 15: 'JC Penny', 16: 'JC Penny', 17: 'JC Penny', 18: 'JC Penny', 19: 'JC Penny', 20: 'JC Penny',21: 'JC Penny'},
 'Invoice Date':
         {0: '7/31/2020', 1: '7/31/2020', 2: '7/31/2020', 3: '7/31/2020', 4: '7/31/2020', 5: '2/14/2020', 6: '12/31/2019',7: '12/23/2019', 8: '7/14/2020', 9: '6/30/2020' , 10: '6/30/2020',
                  11: '6/30/2020',  12: '6/30/2020', 13: '6/30/2020', 14: '3/30/2020', 15: '3/20/2020', 16: '3/20/2020', 17: '3/20/2020', 18: '3/20/2020', 19: '2/27/2020', 20: '2/26/2020', 21: '2/26/2020'},
 'Invoice Number':
         {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}

df = pd.DataFrame(d)

# fix time format
df['Invoice Date'] = pd.to_datetime(df['Invoice Date'])

def count_consecutive_invoice(sr):
    # Find days that are less than or equal to 1 day (m1)
    m1 = sr.diff().le(pd.Timedelta(days=1))
    #  find first days of consecutive days (which is False in m1)
    m2 = (m1.eq(False) & m1.shift(-1).eq(True))
    return m1.sum() + m2.sum()

output = df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] 
      .apply(count_consecutive_invoice) 
      .rename('Count of Consecutive Invoices') 
      .reset_index()

Current Output:

   Vendor ID Vendor Name  Count of Consecutive Invoices
0       1044    JC Penny                              8
1       8555       Macys                             14

Expected Output

   Vendor ID Vendor Name  Count of Consecutive Invoices
0       1044    JC Penny                              7
1       8555       Macys                             10

Answer

For each group:

  • m1: find days are less than or equal to 1 day (m1)
  • m2: find first days of consecutive days (which is False in m1)

Then sum the 2 series.

def count_consecutive_invoice(sr):
    m1 = sr.sort_values().diff().le(pd.Timedelta(days=1))
    m2 = (m1.eq(False) & m1.shift(-1).eq(True))
    return m1.sum() + m2.sum()
# Ensure your 'Invoice Date' column is datetime64
# df['date'] = pd.to_datetime(df['date'])

>>> df.groupby(['Vendor ID', 'Vendor Name'])['Invoice Date'] 
      .apply(count_consecutive_invoice) 
      .rename('Count of Consecutive Invoices') 
      .reset_index()

   Vendor ID Vendor Name  Count of Consecutive Invoices
0       1044    JC Penny                              5
1       8555       Macys                              5

On your real data, the output is now:

   Vendor ID Vendor Name  Count of Consecutive Invoices
0       1044    JC Penny                              7
1       8555       Macys                             10