How to handle missing data in pandas dataframe?

I have a pandas dataframe containing the following information:

  • For each Timestamp, there are a number of Trays (between 1-4) out of 8 available Trays. (So there is a maximum number of 4 Trays per Timestamp.)
  • Each Tray consists of 4 positions.

A dataframe could look like this:

df = 

     timestamp    t_idx  position  error    type    SNR
 0   16229767       5        2       1       T1     123
 1   16229767       5        1       0       T1     123
 3   16229767       5        3       0       T1     123
 4   16229767       5        4       0       T1     123
 5   16229767       3        3       1       T9      38
 6   16229767       3        1       0       T9      38
 7   16229767       3        4       0       T9      38
 8   29767162       7        1       0       T4     991
 9   29767162       7        4       1       T4     991 

If we look at the timestamp “16229767”, there where 2 trays in use: Tray 3 and Tray 5. Each position for Tray 5 was detected. However, Tray 3 has missing data, as position 2 is missing.

I would like to fix that and add this line programmatically:

 10  16229767       3        2       1       T9      38

 11  29767162       7        2       1       T4     991 
 12  29767162       7        3       1       T4     991 

I am not sure how to handle the missing values correctly. My naive approach right now is:

timestamps = df['timestamp'].unique()
for ts in timestamps:
    tray_ids = df.loc[df['timestamp'] == timestamps ]["Tray ID"].unique()
    for t_id in tray_ids:
        # For timestamp and tray id: Each position (1 to 4) should exist once!
        # df.loc[(df['timestamp'] == ts) & (df['Tray ID'] == t_id)] 
        # if not, append the position on the tray and set error to 1

How can I find the missing positions now and add the rows to my dataframe?

===

Edit: I was simplifying my example, but missed a relevant information: There are also other columns and the new generated rows should have the same content per tray. I made it clearer by adding to more columns.

Also, there was a question about the error: For each row that had to be added, the error should be automatically 1 (there is no logic behind).

Answer

We can start by converting position to the categorical type, use a groupby to fill all the missing values and set the corresponding error values to 1.
We also have to fill the type and SNR column with the correct values like so :

>>> df['position'] = pd.Categorical(df['position'], categories=df['position'].unique())
>>> df_grouped = df.groupby(['timestamp', 't_idx', 'position'], as_index=False).first()
>>> df_grouped['error'] = df_grouped['error'].fillna(1)

>>> df_grouped.sort_values('type', inplace=True)
>>> df_grouped['type'] = df_grouped.groupby(['timestamp','t_idx'])['type'].ffill().bfill()

>>> df_grouped.sort_values('SNR', inplace=True)
>>> df_grouped['SNR'] = df_grouped.groupby(['timestamp','t_idx'])['SNR'].ffill().bfill()

>>> df_grouped = df_grouped.reset_index(drop=True)
    timestamp   t_idx   position    error   type    SNR
0   16229767    3       1           0.0     T9      38.0
1   16229767    3       3           1.0     T9      38.0
2   16229767    3       4           0.0     T9      38.0
3   16229767    5       2           1.0     T1      123.0
4   16229767    5       1           0.0     T1      123.0
5   16229767    5       3           0.0     T1      123.0
6   16229767    5       4           0.0     T1      123.0
7   29767162    7       1           0.0     T4      991.0
8   29767162    7       4           1.0     T4      991.0
9   16229767    3       2           1.0     T9      38.0
10  16229767    7       2           1.0     T4      991.0
11  16229767    7       1           1.0     T4      991.0
12  16229767    7       3           1.0     T4      991.0
13  16229767    7       4           1.0     T4      991.0
14  29767162    3       2           1.0     T4      991.0
15  29767162    3       1           1.0     T4      991.0
16  29767162    3       3           1.0     T4      991.0
17  29767162    3       4           1.0     T4      991.0
18  29767162    5       2           1.0     T4      991.0
19  29767162    5       1           1.0     T4      991.0
20  29767162    5       3           1.0     T4      991.0
21  29767162    5       4           1.0     T4      991.0
22  29767162    7       2           1.0     T4      991.0
23  29767162    7       3           1.0     T4      991.0

And then, we filter on the value from the original DataFrame to get the expected result :

>>> df_grouped[
...     pd.Series(
...         list(zip(df_grouped['timestamp'].values, df_grouped['t_idx'].values))
...     ).isin(list(zip(df['timestamp'].values, df['t_idx'].values)))
... ].sort_values(by=['timestamp', 't_idx']).reset_index(drop=True)
    timestamp   t_idx   position    error   type    SNR
0   16229767    3       1           0.0     T9      38.0
1   16229767    3       3           1.0     T9      38.0
2   16229767    3       4           0.0     T9      38.0
3   16229767    3       2           1.0     T9      38.0
4   16229767    5       2           1.0     T1      123.0
5   16229767    5       1           0.0     T1      123.0
6   16229767    5       3           0.0     T1      123.0
7   16229767    5       4           0.0     T1      123.0
8   29767162    7       1           0.0     T4      991.0
9   29767162    7       4           1.0     T4      991.0
10  29767162    7       2           1.0     T4      991.0
11  29767162    7       3           1.0     T4      991.0