How to build a dataframe with clustered timestamps as index from a generator of tuple(key, dict)?

I’m new to Pandas, so maybe I’m missing something very simple here, but searching through other questions didn’t get me what I need.

I have a Python generator that yields tuples of (timestamp, {k1: v1, k2: v2, ...}) # the timestamp is a float and I want to build a dataframe of this form:

datetime(timestamp) (<-- this should be the index) | k1 | k2 | k3 |...

The second request (that might actually help in terms of efficiency) is to have lines that have very close timestamps (<0.3) be merged into a single line (it is promised that the columns will not overlap – i.e. at least one of the lines will have Nan for every column).

The following lines did it for me, but only for as a time series, not as the index of a dataframe, and I don’t know how to “stick it back” into the dataframe:

    times.loc[times.diff() < 0.3] = times[times.diff() > 0.3]
    times = times.pad().map(datetime.fromtimestamp)

The size of the data can get to thousands of (clusters of) timestamps over a million columns.

This option was the fastest for me:

t = {}
for ts, d in file_content:
    for k, v in d.items():
        t.setdefault(ts, {})[k] = v
df1 = pd.DataFrame.from_dict(t, orient='index')

Loading into dict took 14sec, and loading the dict into df took 30sec (where the output dataframe is of size ~1GB), but this is without any optimization over the timestamp clustering.

What’s the best way to load the dataframe, and what’s the code that can build and “attach” the timestamp index to this dataframe?

EDIT: here’s an example of the first tuple from file_content:

In [2]: next(file_content)
Out[2]:
(1628463575.9415462,
 {'E2_S0_ME_rbw': 0,
  'E2_S0_ME_rio': 0,
  'E2_S0_ME_rlat': 0,
  'E2_S0_ME_rmdi': 0,
  'E2_S0_ME_wbw': 0,
  'E2_S0_ME_wio': 0,
  'E2_S0_ME_wlat': 0,
  'E2_S0_ME_wmdi': 0})

EDIT2: the second tuple (note that the timestamp is VERY close to the previous one, AND that the keys are completely different):

In [12]: next(file_content)
Out[12]:
(1628463575.946525,
 {'E2_S1_ME_errors': 0,
  'E2_S1_ME_messages': 0})

Answer

You discovered that you can use a dictionary to load your data, that could be written slightly simpler:

>>> pd.DataFrame.from_dict(dict(file_contents), orient='index')
              E2_S0_ME_rbw  E2_S0_ME_rio  E2_S0_ME_rlat  E2_S0_ME_rmdi  E2_S0_ME_wbw  E2_S0_ME_wio  E2_S0_ME_wlat  E2_S0_ME_wmdi
1.628464e+09             0             0              0              0             0             0              0              0

You can also directly load the iterable into a dataframe and then normalize from there:

>>> fc = pd.DataFrame(file_contents)
>>> fc
              0                                                  1
0  1.628464e+09  {'E2_S0_ME_rbw': 0, 'E2_S0_ME_rio': 0, 'E2_S0_...'
>>> df = pd.json_normalize(fc[1]).join(fc[0].rename('timestamp'))
>>> df
   E2_S0_ME_rbw  E2_S0_ME_rio  E2_S0_ME_rlat  E2_S0_ME_rmdi  E2_S0_ME_wbw  E2_S0_ME_wio  E2_S0_ME_wlat  E2_S0_ME_wmdi     timestamp
0             0             0              0              0             0             0              0              0  1.628464e+09

Now for coalescing lines, let’s start with a dataframe that has values as you describe, here there’s 2 groups, one of rows 0-3 and the other rows 4-5, with at most one non-NaN value per column and coalesced row:

>>> df
      timestamp  E2_S0_ME_rbw  E2_S0_ME_rio  E2_S0_ME_rlat  E2_S0_ME_rmdi  E2_S0_ME_wbw  E2_S0_ME_wio  E2_S0_ME_wlat  E2_S0_ME_wmdi
0  1.628464e+09           NaN           NaN            NaN       0.886793      0.525714           NaN            NaN            NaN
1  1.628464e+09           NaN      0.638154       0.319839            NaN           NaN      0.375288            NaN            NaN
2  1.628464e+09           NaN           NaN            NaN            NaN           NaN           NaN       0.660108            NaN
3  1.628464e+09      0.969127           NaN            NaN            NaN           NaN           NaN            NaN       0.362666
4  1.628464e+09           NaN           NaN            NaN       0.879372           NaN           NaN       0.851226            NaN
5  1.628464e+09      0.029188      0.757706       0.718359            NaN      0.491337      0.239511            NaN       0.503021
>>> df['timestamp'].astype('datetime64[s]')
0   2021-08-08 22:59:35
1   2021-08-08 22:59:36
2   2021-08-08 22:59:36
3   2021-08-08 22:59:36
4   2021-08-08 22:59:36
5   2021-08-08 22:59:37
Name: timestamp, dtype: datetime64[ns]
>>> df['timestamp'].diff()
0    NaN
1    0.2
2    0.2
3    0.2
4    0.4
5    0.2
Name: timestamp, dtype: float64

You want to merge all lines that are within .3s of each other, which we can check with diff(), meaning we start a new group every time a diff is greater than .3s. Using .first() to get the first non-NA value in the row:

>>> df.groupby((df['timestamp'].diff().rename(None) > .3).cumsum()).first()
      timestamp  E2_S0_ME_rbw  E2_S0_ME_rio  E2_S0_ME_rlat  E2_S0_ME_rmdi  E2_S0_ME_wbw  E2_S0_ME_wio  E2_S0_ME_wlat  E2_S0_ME_wmdi
0  1.628464e+09      0.969127      0.638154       0.319839       0.886793      0.525714      0.375288       0.660108       0.362666
1  1.628464e+09      0.029188      0.757706       0.718359       0.879372      0.491337      0.239511       0.851226       0.503021

Note that with .resample() if you’ve got values that are close but to the wrong side of a boundary value, e.g. 0.299s and 0.301s, they’ll get aggregated to different lines.