CLIP4CMR

“A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval”

The original data and pre-calculated CLIP features are available at here. The train.pkl and test.pkl (in raw_data.rar) include image pixel features and text id features, and the clip_train.pkl and clip_test.pkl include 1024-dimensional image and text features.

@article{zeng2022comprehensive,
title={A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval},
author={Zeng, Zhixiong and Mao, Wenji},
journal={arXiv preprint arXiv:2201.02772},
year={2022}
}

GitHub

View Github