DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Created by Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu,

This repository contains PyTorch implementation for DenseCLIP.

DenseCLIP a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from
CLIP. Specifically, we convert the original image-text matching
problem in CLIP to a pixel-text matching problem and
use the pixel-text score maps to guide the learning of dense
prediction models. By further using the contextual information
from the image to prompt the language model, we are
able to facilitate our model to better exploit the pre-trained
knowledge. Our method is model-agnostic, which can be
applied to arbitrary dense prediction systems and various
pre-trained visual backbones including both CLIP models
and ImageNet pre-trained models.


Our code is based on mmsegmentation and mmdetection and timm.

[Project Page] [arXiv]



  • torch>=1.8.0
  • torchvision
  • timm
  • mmcv-full==1.3.17
  • mmseg==0.19.0
  • mmdet==2.17.0
  • fvcore

To use our code, please first install the mmcv-full and mmseg/mmdet following the official guidelines (mmseg, mmdet) and prepare the datasets accordingly.

Pre-trained CLIP Models

Download the pre-trained CLIP models (,, and save them to the pretrained folder.


Model Zoo

We provide DenseCLIP models for Semantic FPN framework.

Model FLOPs (G) Params (M) mIoU(SS) mIoU(MS) config url
RN50-CLIP 248.8 31.0 36.9 43.5 config
RN50-DenseCLIP 269.2 50.3 43.5 44.7 config Tsinghua Cloud
RN101-CLIP 326.6 50.0 42.7 44.3 config
RN101-DenseCLIP 346.3 67.8 45.1 46.5 config Tsinghua Cloud
ViT-B-CLIP 1037.4 100.8 49.4 50.3 config
ViT-B-DenseCLIP 1043.1 105.3 50.6 51.3 config Tsinghua Cloud

Training & Evaluation on ADE20K

To train the DenseCLIP model based on CLIP ResNet-50, run:

bash configs/ 8

To evaluate the performance with multi-scale testing, run:

bash configs/ /path/to/checkpoint 8 --eval mIoU --aug-test

To better measure the complexity of the models, we provide a tool based on fvcore to accurately compute the FLOPs of torch.einsum and other operations:

python /path/to/config --fvcore

You can also remove the --fvcore flag to obtain the FLOPs measured by mmcv for comparisons.


Model Zoo

We provide models for both RetinaNet and Mask-RCNN framework.

Model FLOPs (G) Params (M) box AP config url
RN50-CLIP 265 38 36.9 config
RN50-DenseCLIP 285 60 37.8 config Tsinghua Cloud
RN101-CLIP 341 57 40.5 config
RN101-DenseCLIP 360 78 41.1 config Tsinghua Cloud
Mask R-CNN
Model FLOPs (G) Params (M) box AP mask AP config url
RN50-CLIP 301 44 39.3 36.8 config
RN50-DenseCLIP 327 67 40.2 37.6 config Tsinghua Cloud
RN101-CLIP 377 63 42.2 38.9 config
RN101-DenseCLIP 399 84 42.6 39.6 config Tsinghua Cloud

Training & Evaluation on COCO

To train our DenseCLIP-RN50 using RetinaNet framework, run

 bash configs/ 8

To evaluate the box AP of RN50-DenseCLIP (RetinaNet), run

bash configs/ /path/to/checkpoint 8 --eval bbox

To evaluate both the box AP and the mask AP of RN50-DenseCLIP (Mask-RCNN), run

bash configs/ /path/to/checkpoint 8 --eval bbox segm


MIT License


If you find our work useful in your research, please consider citing:

  title={DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting},
  author={Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2112.01518},


View Github