Efficient way to merge large Pandas dataframes between two dates

I know there are many questions like this one but I can’t seem to find the relevant answer. Let’s say I have 2 data frames as follow:

df1 = pd.DataFrame(
    {
        "end": [
            "2019-08-31",
            "2019-08-28",
            "2019-09-09",
            "2019-09-08",
            "2019-09-14",
            "2019-09-14",
        ],
        "start": [
            "2019-08-27",
            "2019-08-22",
            "2019-08-04",
            "2019-09-02",
            "2019-09-06",
            "2019-09-10",
        ],
        "id": [1234, 8679, 8679, 1234, 1234, 8679],
    }
)

df2 = pd.DataFrame(
    {
        "timestamp": [
            "2019-08-30 10:00",
            "2019-08-28 10:00",
            "2019-08-27 10:30",
            "2019-08-07 12:00",
            "2019-09-12 10:00",
            "2019-09-11 14:00",
            "2019-08-29 18:00",
        ],
        "id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
        "val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
    }
)

df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])

df2["timestamp"] = pd.to_datetime(df2["timestamp"])

df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)

Resulted as:

 end       start    id
0  2019-08-31  2019-08-27  1234
1  2019-08-28  2019-08-22  8679
2  2019-09-09  2019-08-04  8679
3  2019-09-08  2019-09-02  1234
4  2019-09-14  2019-09-06  1234
5  2019-09-14  2019-09-10  8679

 timestamp    id   val
0  2019-08-30 10:00  1234  AAAB
1  2019-08-28 10:00  1234  ABBA
2  2019-08-27 10:30  8679  CXXC
3  2019-08-07 12:00  1234  BBAA
4  2019-09-12 10:00  8679  XCXC
5  2019-09-11 14:00  8679  CCXX
6  2019-08-29 18:00  1234  BAAB

The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:

merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
    (merged_df["timestamp"] >= merged_df["start"])
    & (merged_df["timestamp"] <= merged_df["end"])
]

In which I get the output I wish to have:

           end       start    id         timestamp   val
0   2019-08-31  2019-08-27  1234  2019-08-30 10:00  AAAB
1   2019-08-31  2019-08-27  1234  2019-08-28 10:00  ABBA
3   2019-08-31  2019-08-27  1234  2019-08-29 18:00  BAAB
4   2019-08-28  2019-08-22  8679  2019-08-27 10:30  CXXC
7   2019-09-09  2019-08-04  8679  2019-08-27 10:30  CXXC
19  2019-09-14  2019-09-10  8679  2019-09-12 10:00  XCXC
20  2019-09-14  2019-09-10  8679  2019-09-11 14:00  CCXX

My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.

What I have tried so far:

  • The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.

  • I also tried the pandasql approach which ended with my 16GB RAM PC
    getting stuck.

  • I tried the merge_asof in 3 steps of left join, right join and outer join as
    explained here but I run some tests and it seems to always
    return up to 2 records from df2 to a single line in df1.

Any good advice will be appreciated!

Answer

I’ve been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there… The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.

First version without multi-processing.

import pandas as pd

unk = df1.id.unique()
j = [None] * len(unk)
k = 0

df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)

for i in unk:
    tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')  
    j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
    k += 1
    
res = pd.concat(j)
res

Using Multi-Process

In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.

import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat


def get_val_between(ids, df1, df2):
    """
    Locate all values between 2 dates by id
    Args:
        - ids (list): list of ids

    Returns:
        - concat list of dataframes
    """

    j = [None] * len(ids)
    k = 0

    for i in ids:
        tmp = df1.loc[df1.index.isin([i])].join(
            df2.loc[df2.index.isin([i])], how="left"
        )
        tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]

        # add to list in location k

        j[k] = tmp
        k += 1
    # keep only not None dfs in j

    j = [i for i in j if i is not None]
    if len(j) > 0:
        return pd.concat(j)
    else:
        return None


def grouper(n, iterable, fillvalue=None):
    """grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""

    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)


def main():

    df1.reset_index(inplace=True, drop=True)
    df2.reset_index(inplace=True, drop=True)
    id_lst = df1.id.unique()
    iter_ids = grouper(10, list(id_lst))

    df1.set_index("id", inplace=True)
    df2.set_index("id", inplace=True)

    # set multi-processes

    executor = concurrent.futures.ProcessPoolExecutor(20)
    result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
    concurrent.futures.as_completed(result_futures)
    result_concat = pd.concat(result_futures)

    print(result_concat)


if __name__ == "__main__":
    main()

results as expected:

           end      start           timestamp   val
id                                                  
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00  CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00  CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00  CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00  XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00  ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00  BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00  AAAB

As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.