keras – image and label don’t match in ImageDataGenerator.flow_from_directory

I want to classify about 2000 classes image. so I used the ImageDataGenerator, flow_from_directory.

I made the main directory and 2000 sub directory.
in test1 directory in main directory(test1)

in sub directory in sub directory

Each sub directory have 20 images

(total about 40k images)

And I checked the generator by this script.

from tensorflow.keras.preprocessing.image import ImageDataGenerator

trainDataGen = ImageDataGenerator(rescale=1./255)
trainGenSet = trainDataGen.flow_from_directory(
    './test1',
    batch_size=8,
    target_size=(64,64),
    class_mode='categorical',
    color_mode='grayscale'
)
import numpy as np
import matplotlib.pyplot as plt

a = trainGenSet.next()

plt.imshow(a[0][0])
print(np.argmax(a[1][0]))
plt.show()

Then I watched that image don’t match the label

ex) a[0][0] has 300th images, but a[1][0] has the 1948!!

But the generator worked fine less than 10 classes.

I tested the 10 classes(image of digit 0 ~ 9)

same script for 10 classes generator

Then the generator worked fine!

Why the generator can generate correct pair to less than 10 classes and can’t generate correct pair to more than 10 classes??

Answer

In flow_from_directory you did not specify the value of the parameter shuffle so it defaults to True. Try setting it to False. Also remember that in python files and directories a fetched in alphanumeric order. For example is your list of classes is like [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] the class directories will be fetched by the generator in the order 0,1,10,11,12,13,14,15,16,17,18,19,2,20, 3,4,5,6,7,8,9. That is why you when you have more than 10 sub directories the order is not what you expect. You can avoid this by using “zeros” padding for your sub directory names like 0000 0001 0002 0003 etc 0001999 Remember files are also fetched in alphanumeric order