Multilingual-CLIP
OpenAI CLIP text encoders for any language.
OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.
We propose a fine-tuning to replace the original English text encoder with a pre-trained text model in any language. This method makes it possible to adapt the powerful CLIP model to any language in roughly 24 GPU hours.
This repository contains
- Pytorch inference code
- Tensorflow training code
- Pre-trained CLIP-Text encoders for multiple languages
- Training data and pre-computed CLIP text encodings for a large porton of the the image captions of GCC + MSCOCO + VizWiz
Requirements
While it is possible that other versions works equally fine, we have worked with the following:
- Python = 3.6.9
- Transformers = 4.1.1
- Model Weights
Usage
Download CLIP Model
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
Replace cudatoolkit=11.0
above with the appropriate CUDA version on your machine or cpuonly
when installing on a machine without a GPU.
For more information please see the official CLIP repostitory.
Download Linear Weights
# Linear Model Weights
$ bash get-weights.sh
Inference
from src import multilingual_clip
print(multilingual_clip.AVAILABLE_MODELS.keys())
model = multilingual_clip.load_model('M-BERT-Distil-40')
embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])
For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.
Pre-trained Models
Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.
*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. ***
Name | Model Base | Vision Model | Pre-trained Languages | Target Languages | #Parameters |
---|---|---|---|---|---|
Multilingual | |||||
M-BERT Distil 40 | M-BERT Distil | RN50x4 | 101 Languages | 40 Languages | 66 M |
M-BERT Base 69 | M-BERT Base | RN50x4 | 101 Languages | 68 Languages | 110 M |
Monolingual | |||||
Swe-CLIP 500k | KB-BERT | RN50x4 | Swedish | Swedish | 110 M |
Swe-CLIP 2M | KB-BERT | RN50x4 | Swedish | Swedish | 110 M |
Training a new model
This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:
- Prepare a set of translated sentence pairs from English -> Your Language(s)
- Compute regular CLIP-Text embeddings for the English sentences.
- Edit Training.py to load your data.
- Train a new CLIP-Text encoder via Teacher Learning
Pre-computed CLIP Embeddings & Translaton Data
[This Google Drive folder]https://drive.google.com/drive/folders/1I9a7naSZubUATWzLFv61DQMWyFlF7wR5?usp=sharing) contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.
The Google Drive folder also contains the translation data used to train the currently available models.
Good Luck