I am trying to perform a task of approximation of two embeddings (textual and visual).
For the visual embedding, I am using VGG as the encoder. The output is a
1x1000 embedding. For the textual encoder, I am using a Transformer to which output is shaped
1x712. What I want is to convert both these vectors to the same dimension 512.
img_features.shape, txt_features.shape = (1,1000),(1,712)
How can I do it in PyTorch? Add a final layer in each architecture that models the output to 512?
You could either apply a differentiable PCA operator such as
Alternatively, an easier solution is to use two fully connected adapter layers to learn two mappings. One for you image features
1000 -> n, the other for textual features:
712 -> n. Then you can choose a fusion strategy to combine the two features shaped
(1,n): either using concatenation, point-wise addition/multiplication (in thoses cases
nshould be equal to
512. Esle you can learn a final mapping
n*2 -> 512.