ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

figure

Install

pip install -r requirements.txt
pip install -e .

Download Pretrained Weights

We provide five pretrained weights

  1. ViLT-B/32 Pretrained with MLM+ITM for 200k steps on GCC+SBU+COCO+VG (ViLT-B/32 200k) link
  2. ViLT-B/32 200k finetuned on VQAv2 link
  3. ViLT-B/32 200k finetuned on NLVR2 link
  4. ViLT-B/32 200k finetuned on COCO IR/TR link
  5. ViLT-B/32 200k finetuned on F30K IR/TR link

Out-of-the-box MLM + Visualization Demo

mlm

pip install gradio==1.6.4
python demo.py with num_gpus=<0 if you have no gpus else 1> load_path="<YOUR_WEIGHT_ROOT>/vilt_200k_mlm_itm.ckpt"

ex)
python demo.py with num_gpus=0 load_path="weights/vilt_200k_mlm_itm.ckpt"

Out-of-the-box VQA Demo

vqa

pip install gradio==1.6.4
python demo_vqa.py with num_gpus=<0 if you have no gpus else 1> load_path="<YOUR_WEIGHT_ROOT>/vilt_vqa.ckpt" test_only=True

ex)
python demo_vqa.py with num_gpus=0 load_path="weights/vilt_vqa.ckpt" test_only=True

Citation

If you use any part of this code and pretrained weights for your own purpose, please cite our paper.

@article{kim2021vilt,
  title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
  author={Kim, Wonjae and Son, Bokyung and Kim, Ildoo},
  journal={arXiv preprint arXiv:2102.03334},
  year={2021}
}

GitHub

https://github.com/dandelin/ViLT