I’m having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.
I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn’t seem to work with text data.
TL;DR
How do you create a storable transformer that follows different pipelines for text and numerical data?
Data download and preparation
# imports import numpy as np from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_openml from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.impute import SimpleImputer from sklearn.model_selection import train_test_split from sklearn.pipeline import FeatureUnion, Pipeline from sklearn.preprocessing import StandardScaler np.random.seed(0) # download Titanic data X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) # data preparation numeric_features = ['age', 'fare'] text_features = ['name', 'cabin', 'home.dest'] X.fillna({text_col: '' for text_col in text_features}, inplace=True) # train test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Transforming numerical features: ok
Following the steps in the link above, one can create a transformer for the numerical features as follows:
# handling missing data and normalization numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)]) # this works num_preprocessor.fit(X_train) train_feature_set = num_preprocessor.transform(X_train) test_feature_set = num_preprocessor.transform(X_test) # verify shape = (number of data points, number of numerical features (2) ) train_feature_set.shape # (1047, 2) test_feature_set.shape # (262, 2)
Transforming text features: ok
To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):
# Tfidf of max 30 features text_transformer = TfidfVectorizer(use_idf=True, max_features=30) # apply separately to each column text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features] text_preprocessor = ColumnTransformer(transformers=text_transformer_list) # this works text_preprocessor.fit(X_train) train_feature_set = text_preprocessor.transform(X_train) test_feature_set = text_preprocessor.transform(X_test) # verify shape = (number of data points, number of text features (3) times max_features(30) ) train_feature_set.shape # (1047, 90) test_feature_set.shape # (262, 90)
How do you do both at once?
I’ve tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.
Attempt 1: Follow documented strategy
Following the documentation (Column Transformer with Mixed Types) doesn’t work, once text data replaces categorical data:
# documented strategy sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features), ('text', text_transformer, text_features)]) # fails sum_preprocessor.fit(X_train)
returns following error message:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
Attempt 2: FeatureUnion
on the lists of transformers
# create a list of numerical transformer, like those for text numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features] # fails column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns following error message:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
Attempt 3: ColumnTransformer
on the lists of transformers
# create a list of all transformers, text and numerical sum_transformer_list = text_transformer_list + numerical_transformer_list # works sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list) # fails sum_preprocessor.fit(X_train)
returns following error message:
ValueError: Expected 2D array, got 1D array instead: array=[54. nan nan ... 20. nan nan]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
My question
How do I create a single object that can fit
and transform
data mixing text and numerical types?
Answer
Short answer:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)] all_preprocessor = ColumnTransformer(transformers=all_transformers) all_preprocessor.fit(X_train) train_all = all_preprocessor.transform(X_train) test_all = all_preprocessor.transform(X_test) print(train_all.shape, test_all.shape) # prints (1047, 92) (262, 92)
The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer
handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.
So, to explain the errors in the three attempts:
Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.
Attempt 2: FeatureUnion
doesn’t have the same input format as ColumnTransformer
: you shouldn’t have triples (name, transformer, columns)
in this case. Anyway, FeatureUnion
isn’t meant for what you’re doing here.
Attempt 3: This time you’re trying to send 1d data through to the numerical transformer, but those are expecting 2d data.