METER
Code and pre-trained models will be publicized soon.
Citation
@article{dou2021meter,
title={An Empirical Study of Training End-to-End Vision-and-Language Transformers},
author={Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
journal={arXiv},
year={2021},
url={https://arxiv.org/abs/2111.02387},
}
Acknowledgements
The code is based on ViLT and some of the code is borrowed from CLIP and Swin-Transformer.