Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)


Install dependencies.


conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt


conda deactivate # Make sure to use your global python3
python3 -m pip install venv
python3 -m venv ./ff_vqgan_clip_venv
source ./ff_vqgan_clip_venv/bin/activate
$ (ff_vqgan_clip_venv) python -m pip install -r requirements.txt

How to use?

(Optional) Pre-tokenize Text

$ (ff_vqgan_clip_venv) python main.py tokenize data/list_of_captions.txt cembeds 128


Modify configs/example.yaml as needed.

$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml


Loss will be output for tensorboard.

# in a new terminal/session
(ff_vqgan_clip_venv) pip install tensorboard
(ff_vqgan_clip_venv) tensorboard --logdir results

Pre-trained models

Name Type Size Dataset Link Author
cc12m_8x128 VitGAN 12.1MB Conceptual captions 12M Download @mehdidc
cc12m_16x256 VitGAN 60.1MB Conceptual captions 12M Download @mehdidc
cc12m_32x512 VitGAN 408.4MB Conceptual captions 12M Download @mehdidc
cc12m_32x1024 VitGAN 1.55GB Conceptual captions 12M Download @mehdidc
cc12m_64x1024 VitGAN 3.05GB Conceptual captions 12M Download @mehdidc
bcaptmod_8x128 VitGAN 11.2MB Modified blog captions Download @afiaka87
bcapt_16x128 MLPMixer 168.8MB Blog captions Download @mehdidc

You can also access them from here

NB: cc12m_AxB means a model trained on conceptual captions 12M, with depth A and hidden state dimension B

After downloading a model or finishing training your own model, you can test it with new prompts, e.g.,

python -u main.py test pretrained_models/cc12m_32x1024/model.th "an armchair in the shape of an avocado"

You can also try it in the Colab Notebook. Using the notebook you can generate images from pre-trained models and do interpolations between text prompts to create videos, see for instance video 1 or video 2 or video 3