I wanted to know how can I insert into a sklearn pipeline one step which multiplies two columns values and delete the original ones.
I’m doing something like that.
- After loading the Dataframe, I multiply the target columns and delete them.
- Prepare X, Y, training set and test set.
- Configure pipeline with StandardScaler and some ML method (for example Linear Regression)
- Fit and predict.
import pandas as pd, numpy as np from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline # df is a pandas dataframe with columns A, B, C, Y df['BC']=df['B']*te['C'] df.drop(columns=['B','C'], inplace=True) X = df.loc[:,['A','BC']] Y = df['Y'] x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8) pipe = Pipeline([ ('minmax',StandardScaler()), ('linear',LinearRegression()) ]) pipe.fit(x_train,y_train) y_pred = pipe.predict(x_test)
With this approach, when I want to make some prediction of new data, I must pass the multiplication, for example A=1, B=3, C=4
And I want an approach like
What I want, is modify pipeline for something like
pipe = Pipeline([ ('product', CustomFunction(columns_to_multiply, result_name_column)), ('minmax',StandardScaler()), ('linear',LinearRegression()) ])
Is it possible with scikit-learn or custom functions? How?
I am unable to fully test your codes because of missing data. However, you may be able to adopt
FunctionTransfomer as follows:
def CustomMultiplier(arrs): a = arrs[:,0] b = np.prod(arrs[:,1:], axis=1) return np.column_stack((a, b)) if __name__ == '__main__': transformer = FunctionTransformer(CustomMultiplier) X = np.array([[1,3,4], [2,4,5]]) result = transformer.transform(X) print(result)
[[ 1 12] [ 2 20]]