Create sklearn pipeline with column operations step

I wanted to know how can I insert into a sklearn pipeline one step which multiplies two columns values and delete the original ones.

I’m doing something like that.

  • After loading the Dataframe, I multiply the target columns and delete them.
  • Prepare X, Y, training set and test set.
  • Configure pipeline with StandardScaler and some ML method (for example Linear Regression)
  • Fit and predict.
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


# df is a pandas dataframe with columns A, B, C, Y
df['BC']=df['B']*te['C']
df.drop(columns=['B','C'], inplace=True)

X = df.loc[:,['A','BC']]
Y = df['Y']

x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8)

pipe = Pipeline([
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)

With this approach, when I want to make some prediction of new data, I must pass the multiplication, for example A=1, B=3, C=4

print(pipe.predict(np.array([[1,12]])))

And I want an approach like

print(pipe.predict(np.array([[1,3,4]])))

What I want, is modify pipeline for something like

pipe = Pipeline([
    ('product', CustomFunction(columns_to_multiply, result_name_column)),
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

Is it possible with scikit-learn or custom functions? How?

Answer

I am unable to fully test your codes because of missing data. However, you may be able to adopt FunctionTransfomer as follows:

Code:

def CustomMultiplier(arrs):
    a = arrs[:,0]
    b = np.prod(arrs[:,1:], axis=1)
    return np.column_stack((a, b))

if __name__ == '__main__':
    transformer = FunctionTransformer(CustomMultiplier)
    X = np.array([[1,3,4], [2,4,5]])
    result = transformer.transform(X)
    print(result)

Result:

[[ 1 12]
 [ 2 20]]