How to Optimize Performance in Nested Loops Iterating a Pandas Dataframe

Suppose the below dataframe df:

import pandas as pd
data = {"Time":["2021-01-10 21:00:00", "2021-01-10 22:00:00", 
                "2021-01-10 21:30:01", "2021-01-10 21:45:00",
                "2021-01-12 09:00:00", "2021-01-12 09:30:00"],
        "ID":["1","1","2","2","2","2"],
        "Event":["cut","cut", "smooth","smooth","cut","cut"],
        "Status":["start", "complete", "start", "complete","start", "complete",]}
df = pd.DataFrame(data)  
df["Time"] = pd.to_datetime(df["Time"])  
df["ID"] = df["ID"].astype("int")  
df

enter image description here

My final goal is to calculate the total production time per unique ‘ID’, without taking into account any potential time breaks between each time interval. The start time for each ID is the 1st instance of “start” Status, and the end production time is the last instance of “complete” Status per ID. E.g., for ID==1 this is 1h (3600s), while for ID==2 is about 45min (15min in the 1st, and 30min in the 2nd time interval).

Since I would also be interested in capturing the time intervals per unique ID (e.g., ID==1 has only 1 interval which coincides with its total production time, ID==2 has 2 pairs of start-complete statuses, and hence 2 intervals), what I thought to do is create two dictionaries: ‘time_diff_dict’, and ‘cumulativeSumId’:

  • ‘time_diff_dict’: key:unique ID, values: the time intervals
  • ‘cumulativeSumId’: key: unique ID, values: the cumulative sum of the time intervals above

In this way, in the ‘cumulativeSumId’ dictionary, the last key value per each key (ID) would be equal to its total production time.

However, imagine that the real df has about 180,000 rows with about 3000 unique IDs, and it takes about 10min to terminate the below code. Probably I will have to use iterations methods like the ones described here or more efficient nested loops, however, I need some help implementing a better performance for this particular case. My current code is:

# dictionary containing the time intervals of IDs (delta of complete-start)
# key: ID, value: time interval for every unique complete-start completion per unique ID
time_diff_dict = {key: [] for key in list(df.ID.unique())}

# same structure as time_diff_dict, but here storing the time_diff_dict values per ID
cumulativeSumId = {key: [] for key in list(df.ID.unique())} 

# initialise time difference to 0 before calculating time interval
time_diff = 0

for unique_ID in df.ID.unique():
  for row in df.itertuples(index=True, name="Pandas"):
    if row.ID == unique_ID:
      if row.Status == "start":
        start = row.Time
        cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))
      elif row.Status == "complete":
        end = row.Time
        delta = end - start
        time_diff = delta.total_seconds()
        time_diff_dict[unique_ID].append(time_diff)
        cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))

dict_values = list(cumulativeSumId.values())
df["Cumulative_Time"] = [item for sublist in dict_values for item in sublist]

The result of df now:

enter image description here

, where e.g. for ID==1, the total production time is 3600sec, and for ID==2 is 2699sec because this is the last instance in its cumulative sum time dictionary.

After that, I create a new df with: unique IDs, “totalTimeId”, and “timeIntervals”:

''' 
* create list of lists 
* every sublist is a dataframe per unique ID
'''
lists_of_IDdfs =[]

for id, df_id in df.groupby("ID"):
  lists_of_IDdfs.append(df_id)

data = []
for df in range(len(lists_of_IDdfs)):
  data.append((lists_of_IDdfs[df].ID.iloc[-1], lists_of_IDdfs[df].Cumulative_Time.iloc[-1]))
df_ID_TotalTime = pd.DataFrame(data, columns= ["ID", "totalTimeId"])

'''add the respective time interval data points per unique ID'''
df_ID_TotalTime["timeIntervals"] = df_ID_TotalTime["ID"].map(time_diff_dict)
df_ID_TotalTime

Final desired result:

enter image description here

I would appreciate any thoughts and help! Thank you!

Answer

You can reshape your dataframe using pivot, compute the difference between the two datetimes and groupby “ID” to aggregate the data:

# pre-requisite ensure that Time is of datetime type
df['Time'] = pd.to_datetime(df['Time'])

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: d['complete']-d['start'])
   .groupby('ID')['time'].sum()
)

output:

ID
1   0 days 00:30:00
2   0 days 00:24:58

To get the output in seconds:

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: d['complete']-d['start'])
   .groupby('ID')['time'].sum()
   .dt.total_seconds()
)

output:

ID
1    1800.0
2    1498.0

alternative output:

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: (d['complete']-d['start']).dt.total_seconds())
   .groupby('ID')['time'].agg(totalTimeId='sum', timeIntervals=list)
)

output:

    totalTimeId    timeIntervals
ID                              
1        3600.0         [3600.0]
2        2699.0  [1800.0, 899.0]

edit how to handle duplicates:

You need to add a secondary index that is unique (ID2)

(df.assign(ID2=df.groupby(['ID', 'Event', 'Status']).cumcount())
   .pivot(index=['ID', 'ID2', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: (d['complete']-d['start']).dt.total_seconds())
   .groupby('ID')['time'].agg(totalTimeId='sum', timeIntervals=list)
)

input:

                 Time  ID   Event    Status
0 2021-01-10 21:00:00   1     cut     start
1 2021-01-10 22:00:00   1     cut  complete
2 2021-01-10 21:30:01   2  smooth     start
3 2021-01-10 21:45:00   2  smooth  complete
4 2021-01-12 09:00:00   2     cut     start
5 2021-01-12 09:30:00   2     cut  complete
6 2021-01-12 09:30:00   2     cut     start
7 2021-01-12 09:35:00   2     cut  complete

intermediate:

Status                   complete               start
ID ID2 Event                                         
1  0   cut    2021-01-10 22:00:00 2021-01-10 21:00:00
2  0   cut    2021-01-12 09:30:00 2021-01-12 09:00:00
       smooth 2021-01-10 21:45:00 2021-01-10 21:30:01
   1   cut    2021-01-12 09:35:00 2021-01-12 09:30:00

output:

    totalTimeId           timeIntervals
ID                                     
1        3600.0                [3600.0]
2        2999.0  [1800.0, 899.0, 300.0]