As far as I know, pre-trained models play well in many tasks as a feature-extractor, thanks to their abundant training dataset.
However, I’m wondering that whether the model, let’s say
have certain ability to extract some “semantic” information from input image?
If the answer is positive, given an
is it possible to “cluster” images by measuring the semantic similarities of the extracted features?
Actually, I’ve spent some efforts:
- Load pre-trained vgg-16 through Pytorch.
- Load Cifar-10 dataset and transform to batched-tensor
X, of size(5000, 3, 224, 224).
- Fine-tune vgg.classifier, define its output dimension as 4096.
- Extract features:
features = vgg.features(X).view(X.shape, -1) # X: (5000, 3, 224, 224) features = vgg.classifier(features) # features: (5000, 25088) return features # features: (5000, 4096)
- Try out
torch.cdist, however, only to find several bad clusters.
Any suggestion? Thanks in advance.
You might not want to go all the way to the last layer, as these contain features specific to the classification task at hand. Using features from layers higher up in the classifier might help. Additionally, you want to switch to eval mode since VGG-16 has a dropout layer in its classifier.
>>> vgg16 = torchvision.models.vgg(pretrained=True).eval()
Truncate the classifier:
>>> vgg16.classifier = vgg16.classifier[:4]
vgg16‘s classifier will look like:
(classifier): Sequential( (0): Linear(in_features=25088, out_features=4096, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.5, inplace=False) (3): Linear(in_features=4096, out_features=4096, bias=True) )
Then extract the features:
>>> vgg16(torch.rand(1, 3, 124, 124)).shape torch.Size([1, 4096])