This repository is the official Pytorch implementation for CVPR2022 paper Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding.

Update on 2022/03/15: Release the training code.


  • Release the training code.
  • Release the code for generating pseudo-samples.


If you find our project useful in your research, please consider citing:

  title={Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding},
  author={Jiang, Haojun and Lin, Yuanze and Han, Dongchen and Song, Shiji and Huang, Gao},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},


  1. Introduction
  2. Usage
  3. Results
  4. Contacts
  5. Acknowledgments


We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO without degrading original model’s performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-the-art weakly-supervised visual grounding methods on all the five datasets we have experimented. For more details. please refer to our paper.



Data Preparation

1.You can download the images from the original source and place them in ./data/image_data folder:

2.The generated pseudo region-query pairs can be download from Tsinghua Cloud.

mkdir data
mv pseudo_samples.tar.gz ./data/
tar -zxvf pseudo_samples.tar.gz

Pretrained Checkpoints

1.You can download the DETR checkpoints from Tsinghua Cloud. These checkpoints should be downloaded and move to the checkpoints directory.

mkdir checkpoints
mv detr_checkpoints.tar.gz ./checkpoints/
tar -zxvf checkpoints.tar.gz

2.Checkpoints that trained on our pseudo-samples can be downloaded from Tsinghua Cloud. You can evaluate the checkpoints following the instruction right below.

mv pseudoq_checkpoints.tar.gz ./checkpoints/
tar -zxvf pseudoq_checkpoints.tar.gz

Training and Evaluation

  1. Training on RefCOCO.

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 28888 --use_env --num_workers 8 --epochs 10 --batch_size 32 --lr 0.00025 --lr_bert 0.000025 --lr_visu_cnn 0.000025 --lr_visu_tra 0.000025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --backbone resnet50 --detr_model checkpoints/detr-r50-unc.pth --bert_enc_num 12 --detr_enc_num 6 --dataset unc --max_query_len 20 --data_root ./data/image_data --split_root ./data/pseudo_samples/ --prompt "find the region that corresponds to the description {pseudo_query}" --output_dir ./outputs/unc/;

    Please refer to scripts/ for training commands on other datasets.

  2. Evaluation on RefCOCO.

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 28888 --use_env --num_workers 4 --batch_size 128 --backbone resnet50 --bert_enc_num 12 --detr_enc_num 6 --dataset unc --max_query_len 20 --data_root ./data/image_data --split_root ./data/pseudo_samples/ --eval_model ./checkpoints/unc_best_checkpoint.pth --eval_set testA --prompt "find the region that corresponds to the description {pseudo_query}" --output_dir ./outputs/unc/testA/;

    Please refer to scripts/ for evaluation commands on other splits or datasets.


Results on RefCOCO/RefCOCO+/RefCOCOg

Results on ReferItGame/Flickr30K Entities

Please refer to our paper for more details.


jhj20 at mails dot tsinghua dot edu dot cn

Any discussions or concerns are welcomed!


This codebase is based on TransVG.


View Github