OpenAI CLIP text encoders for any language.
OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.
We propose a fine-tuning to replace the original English text encoder with a pre-trained text model in any language. This method makes it possible to adapt the powerful CLIP model to any language in roughly 24 GPU hours.
This repository contains
- Pytorch inference code
- Tensorflow training code
- Pre-trained CLIP-Text encoders for multiple languages
- Training data and pre-computed CLIP text encodings for a large porton of the the image captions of GCC + MSCOCO + VizWiz
While it is possible that other versions works equally fine, we have worked with the following:
- Python = 3.6.9
- Transformers = 4.1.1
- Model Weights
Download CLIP Model
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 $ pip install ftfy regex tqdm $ pip install git+https://github.com/openai/CLIP.git
cudatoolkit=11.0 above with the appropriate CUDA version on your machine or
cpuonly when installing on a machine without a GPU.
For more information please see the official CLIP repostitory.
Download Linear Weights
# Linear Model Weights $ bash get-weights.sh
from src import multilingual_clip print(multilingual_clip.AVAILABLE_MODELS.keys()) model = multilingual_clip.load_model('M-BERT-Distil-40') embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?']) print(embeddings.shape) # Yields: torch.Size([3, 640])
For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.
Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.
*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. ***
|Name||Model Base||Vision Model||Pre-trained Languages||Target Languages||#Parameters|
|M-BERT Distil 40||M-BERT Distil||RN50x4||101 Languages||40 Languages||66 M|
|M-BERT Base 69||M-BERT Base||RN50x4||101 Languages||68 Languages||110 M|
|Swe-CLIP 500k||KB-BERT||RN50x4||Swedish||Swedish||110 M|
|Swe-CLIP 2M||KB-BERT||RN50x4||Swedish||Swedish||110 M|
Training a new model
This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:
- Prepare a set of translated sentence pairs from English -> Your Language(s)
- Compute regular CLIP-Text embeddings for the English sentences.
- Edit Training.py to load your data.
- Train a new CLIP-Text encoder via Teacher Learning
Pre-computed CLIP Embeddings & Translaton Data
[This Google Drive folder]https://drive.google.com/drive/folders/1I9a7naSZubUATWzLFv61DQMWyFlF7wR5?usp=sharing) contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.
The Google Drive folder also contains the translation data used to train the currently available models.