Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training

Code for NeurIPS 2021 paper “Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training” by Lue Tao, Lei Feng, Jinfeng Yi, Sheng-Jun Huang, and Songcan Chen.
This repository contains an implementation of the attacks (P1~P5) and the defense (adversarial training) in the paper.


Our code relies on PyTorch, which will be automatically installed when you follow the instructions below.

conda create -n delusion python=3.8
conda activate delusion
pip install -r requirements.txt

Running Experiments

  1. Pre-train a standard model on CIFAR-10 (the dataset will be automatically download).

python --train_loss ST
  1. Generate perturbed training data.

python --poison_type P1
python --poison_type P2
python --poison_type P3
python --poison_type P4
python --poison_type P5
  1. Visualize the perturbed training data (optional).

tensorboard --logdir ./results
  1. Standard training on the perturbed data.

python --train_loss ST --poison_type P1
python --train_loss ST --poison_type P2
python --train_loss ST --poison_type P3
python --train_loss ST --poison_type P4
python --train_loss ST --poison_type P5
  1. Adversarial training on the perturbed data.

python --train_loss AT --poison_type P1
python --train_loss AT --poison_type P2
python --train_loss AT --poison_type P3
python --train_loss AT --poison_type P4
python --train_loss AT --poison_type P5


Figure 1: An illustration of delusive attacks and adversarial training. Left: Random samples from the CIFAR-10 training set: the original training set D and the perturbed training set DP5 generated using the P5 attack. Right: Natural accuracy evaluated on the CIFAR-10 test set for models trained with: i) standard training on D; ii) adversarial training on D; iii) standard training on DP5; iv) adversarial training on DP5. While standard training on DP5 incurs poor generalization performance on D, adversarial training can help a lot.


Table 1: Below we report mean and standard deviation of the test accuracy for the CIFAR-10 dataset. As we can see, the performance deviations of the defense (i.e., adversarial training) are very small (< 0.50%), which hardly effect the results. In contrast, the results of standard training are relatively unstable.

Training method \ Training data P1 P2 P3 P4 P5
Standard training 37.87±0.94 74.24±1.32 15.14±2.10 23.69±2.98 11.76±0.72
Adversarial training 86.59±0.30 89.50±0.21 88.12±0.39 88.15±0.15 88.12±0.43


Key takeaways: Our theoretical justifications in the paper, along with the empirical results, suggest that adversarial training is a principled and promising defense against delusive attacks.

Citing this work

    title={Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training},
    author={Tao, Lue and Feng, Lei and Yi, Jinfeng and Huang, Sheng-Jun and Chen, Songcan},
    booktitle={Advances in Neural Information Processing Systems (NeurIPS)},


View Github