How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.

In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):

data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]

sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)

Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:

kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)

for train_index, test_index in kfolder.split(data_sampled):

        data_train, data_test = data_sampled[train_index], data_sampled[test_index]

Or the train_test_split method:

data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)

Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I’m concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it’s just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.

Answer

As it turns out, the problem originates from a combination of sources:

While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.

This is something the documentation points out, thanks to this post: https://github.com/scikit-learn/scikit-learn/issues/16068

Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of “hard-to-predict” entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I’ve expected closer results after only one fold, which lead to the discovery of this problem.