[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Official Pytorch implementation of Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding (AAAI 2022).

Paper is at https://arxiv.org/pdf/2109.04872.pdf.

Paper explanation in Zhihu (in Chinese) is at https://zhuanlan.zhihu.com/p/446203594.


Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.


Dec, 2021 – We uploaded the code and trained weights for Charades-STA, ActivityNet-Captions and TACoS datasets.

Todo: The code for spatio-temporal video grounding (HC-STVG dataset) will be available soon.


  • Download the video feature and the groundtruth provided by 2D-TAN.
  • Extract and put them in a dataset folder in the same directory as train_net.py. For configurations of feature/groundtruth’s paths, please refer to ./mmn/config/paths_catalog.py. (ann_file is annotation, feat_file is the video feature)


Our code is developed on the third-party implementation of 2D-TAN, so we have similar dependencies with it, such as:

yacs h5py terminaltables tqdm pytorch transformers 

Quick Start

We provide scripts for simplifying training and inference. For training our model, we provide a script for each dataset (e.g., ./scripts/tacos_train.sh). For evaluating the performance, we provide ./scripts/eval.sh.

For example, for training model in TACoS dataset in tacos_train.sh, we need to select the right config in config and decide the GPU by yourself in gpus (gpu id in your server) and gpun (total number of gpus).

# find all configs in configs/
# set your gpu id
# number of gpus
# please modify it with different value (e.g.,, 29502) when you run multi mmn task on the same machine

Similarly, to evaluate the model, just change the information in eval.sh. Our trained weights for three datasets are in the Google Drive.


If you find our code useful, please generously cite our paper. (AAAI version bibtex will be updated later)

  author    = {Zhenzhi Wang and
               Limin Wang and
               Tao Wu and
               Tianhao Li and
               Gangshan Wu},
  title     = {Negative Sample Matters: {A} Renaissance of Metric Learning for Temporal
  journal   = {CoRR},
  volume    = {abs/2109.04872},
  year      = {2021}


For any question, please raise an issue (preferred) or contact

Zhenzhi Wang: [email protected]


We appreciate 2D-TAN for video feature and configurations, and the third-party implementation of 2D-TAN for its implementation with DistributedDataParallel. Disclaimer: the performance gain of this third-party implementation is due to a tiny mistake of adding val set into training, yet our reproduced result is similar to the reported result in 2D-TAN paper.


View Github