We present new splits of Charades-STA and ActivityNet Captions, called Charades-CD (Charades-STA under Changing Distributions) and ActivityNet-CD, respectively. These new splits are created by re-organizing all splits (i.e., the training, validation and test set) of original datasets, and the ground truth moment distributions are designed different in the training and test splits, i.e., out-of-distribution (OOD) testing. To better demonstrate the temporal sentence grounding models’ generalization ability and compare the performance between the OOD samples and the independent and identically distributed (IID) samples, we also maintain a test split with IID samples, denoted as test-iid (vs. test-ood).
To demonstrate the difficulty of the new proposed splits (i.e., Charades-CD and ActivityNet-CD), we compare the performance of two simple baselines and eight state-ofthe-art methods on both the original and proposed splits.
- Bias-based method: it uses the gaussian kernel density estimation to fit the moment annotation distribution, and randomly samples several locations based on the fitted distribution as the moment predictions.
- PredictAll method: it directly predicts the whole video as the moment predictions.
For all these SOTA methods, we use the public official implementations to get their temporal grounding results. The results of the proposed test-iid and test-ood sets on two
datasets come from the same model finetuned on their respective val set. For more fair comparisons, we have unified the feature representations of the videos and sentence queries. To cater for most of TSGV methods, we use I3D feature for the videos in Charades-STA (Charades-CD), and C3D feature for the videos in ActivityNet Captions (Activity-CD). Each word in the query sentences is encoded by a GloVe word representation.
We report the performance of all mentioned TSGV methods with metric [email protected],[email protected] in the figure below. From this figure, we can observe that almost all methods have a significant
performance gap between the test-iid and test-ood sets, i.e., these methods always over-rely on the moment annotation biases, and fail to generalize to the OOD test set. Meanwhile, the performance results on the original test set and the proposed test-iid set are relatively close, because the moment distribution of the test-iid set is still similar to the majority of the whole dataset.