How to measure the semantic similarities among image features extracted by pre-trained models(e.g. vgg, resnet…)?

As far as I know, pre-trained models play well in many tasks as a feature-extractor, thanks to their abundant training dataset.

However, I’m wondering that whether the model, let’s say vgg-16,

have certain ability to extract some “semantic” information from input image?

If the answer is positive, given an unlabeled dataset,

is it possible to “cluster” images by measuring the semantic similarities of the extracted features?

Actually, I’ve spent some efforts:

  1. Load pre-trained vgg-16 through Pytorch.
  2. Load Cifar-10 dataset and transform to batched-tensor X, of size(5000, 3, 224, 224).
  3. Fine-tune vgg.classifier, define its output dimension as 4096.
  4. Extract features:
 features = vgg.features(X).view(X.shape[0], -1) # X: (5000, 3, 224, 224)

 features = vgg.classifier(features) # features: (5000, 25088)

 return features # features: (5000, 4096)
  1. Try out cosine similarity, inner product, torch.cdist, however, only to find several bad clusters.

Any suggestion? Thanks in advance.


You might not want to go all the way to the last layer, as these contain features specific to the classification task at hand. Using features from layers higher up in the classifier might help. Additionally, you want to switch to eval mode since VGG-16 has a dropout layer in its classifier.

>>> vgg16 = torchvision.models.vgg(pretrained=True).eval()

Truncate the classifier:

>>> vgg16.classifier = vgg16.classifier[:4]

Now vgg16‘s classifier will look like:

(classifier): Sequential(
  (0): Linear(in_features=25088, out_features=4096, bias=True)
  (1): ReLU(inplace=True)
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=4096, out_features=4096, bias=True)

Then extract the features:

>>> vgg16(torch.rand(1, 3, 124, 124)).shape
torch.Size([1, 4096])