METER

Code and pre-trained models will be publicized soon.

Citation

@article{dou2021meter,
  title={An Empirical Study of Training End-to-End Vision-and-Language Transformers},
  author={Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
  journal={arXiv},
  year={2021},
  url={https://arxiv.org/abs/2111.02387},
}

Acknowledgements

The code is based on ViLT and some of the code is borrowed from CLIP and Swin-Transformer.

GitHub

View Github