This loads both the text and vision encoders using pre-trained weights, the projection layers are randomly
initialized except for CLIP's vision model. If you use CLIP to initialize the vision model then the vision projection weights are also
loaded using the pre-trained weights.
## Prepare the dataset
## Prepare the dataset
We will use the MS-COCO dataset to train our dual encoder model. MS-COCO contains over 82,000 images, each of which has at least 5 different caption annotations. The dataset is usually used for image captioning tasks, but we can repurpose the image-caption pairs to train our dual encoder model for image search.
We will use the MS-COCO dataset to train our dual encoder model. MS-COCO contains over 82,000 images, each of which has at least 5 different caption annotations. The dataset is usually used for image captioning tasks, but we can repurpose the image-caption pairs to train our dual encoder model for image search.
...
@@ -124,7 +152,7 @@ with open("coco_dataset/valid_dataset.json", "w") as f:
...
@@ -124,7 +152,7 @@ with open("coco_dataset/valid_dataset.json", "w") as f:
Next we can run the example script to train the model:
Next we can run the example script to train the model: