How to fix randomization in sklearn

I am trying to fix the randomization in my code but every time I run, I get different best score and best parameters. The results are no too far apart, but how can I fix the result to get the same best score and parameters every time I run?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 27)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

clf = DecisionTreeClassifier(random_state=None)

parameter_grid = {'criterion': ['gini', 'entropy'],
                  'splitter': ['best', 'random'],
                  'max_depth': [1, 2, 3, 4, 5,6,8,10,20,30,50],
                  'max_features': [10,20,30,40,50]

skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(X_train, y_train)

grid_search = GridSearchCV(clf, param_grid=parameter_grid, cv=skf, scoring='precision'), y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

y_pred_iris = clf.predict(X_test)


In order to get reproducible results, every source or randomness in your code must be explicitly seeded (and even then, you must be careful that the implicit assumption of all other being equal actually holds – see Why does the importance parameter influence performance of Random Forest in R? for a case where it does not).

There are three parts in your code that inherently include a random element:

  • train_test_split
  • DecisionTreeClassifier
  • StratifiedKFold

You correctly seed the first one (using random_state=27), but you fail to do so for the other two, leaving random_state=None in both of them.

What you should do is simply replace the two cases of random_state=None in your code with an explicit seed, as you have done for train_test_split; it doesn’t have to be any specific number, or even the same for all cases, it just needs to be explicitly set.

Leave a Reply

Your email address will not be published. Required fields are marked *