“setting with enlargement” behavior in Pandas

Could you help me understand the following.

df1 is a dataframe indexed by date. df2 is another dataframe also indexed by date, but df1 and df2 only have one date in common, 2018-12-31. See the following output:

df1 = pd.DataFrame({'A':[1,2,3]}, index = pd.to_datetime(["2018-12-31","2019-12-31","2020-12-31"]))
df2 = pd.DataFrame({'XX':[700,800]}, index = pd.to_datetime(["2016-12-31","2018-12-31"]))
df1['BB'] = df2['XX']

            A   BB
2018-12-31  1   800.0
2019-12-31  2   NaN
2020-12-31  3   NaN

I realize that this works, even though the number of rows is not the same on the left and right hand side of df1['BB'] = df2['XX'].

Is this an abbreviation for a more complex expression? Since the result keeps all the rows in df1, but doesn’t expand the index to include the rows in df2. Is this operation like a “left merge” (“left join”)? Will it ever include the union of df1's and df2's indexes, filling with NaNs appropriately?

I am using pandas version 1.2.5

Answer

Setting with enlargement is part of __set_item__ which is called when performing operations like:

df['col'] = someSeries

In this case the function _set_item is called which is used to “Add series to DataFrame in specified column.”

Since DataFrames are indexed, and all of the Series therein contained are also indexed in the same way, the new Series must match the DataFrame index it is being added to.

The _sanitize_column function makes sure the column is compatible. In this case, a Series.reindex operation is needed for compatibility (index alignment). This happens in _reindex_for_setitem.


So in answer to the questions

“Is this an abbreviation for a more complex expression?”

Yes, the resolved expression is:

df1['BB'] = df2['XX'].reindex(df1.index)._values

Is this operation like a “left merge” (“left join”)?

Yes. The operation is similar to “left merge” (in effect) as the resulting Series will contain all the keys in df1 since the Series is reindexed to match df1.

In this specific example, join produces the exact same result:

df1 = df1.join(df2.rename(columns={'XX': 'BB'}))

Will it ever include the union of df1’s and df2’s indexes, filling with NaNs appropriately?

No, it will only ever contain the indexes from the DataFrame it is being added to.