Backdoors 101

Backdoors 101 — is a PyTorch framework for state-of-the-art backdoor defenses and attacks on deep learning models. It includes real-world datasets, centralized and federated learning, and supports various attack vectors. The code is mostly based on "Blind Backdoors in Deep Learning Models" and "How To Backdoor Federated Learning" papers, but we always look for incorporating newer results.

If you have a new defense or attack, let us know (raise an issue or send an email), happy to help porting it. If you are doing research on backdoors and want some assistance don't hesitate to ask questions.

Current status

We try to incorporate new attacks and defenses as well as to extend
the supported datasets and tasks. Here is the high-level overview of the
possible attack vectors:



  • Pixel-pattern (incl. single-pixel) - traditional pixel modification attacks.
  • Physical - attacks that are triggered by physical objects.
  • Semantic backdoors - attacks that don't modify the input (e.g. react on
    features already present in the scene).

TODO clean-label (good place to contribute).

Injection methods

  • Data poisoning - adds backdoors into the dataset.
  • Batch poisoning - injects backdoor samples directly into the batch during
  • Loss poisoning - modifies the loss value during training (supports dynamic
    loss balancing, see Sec 3.4 )

TODO: model poisoning (good place to contribute!).


  • Image Classification - ImageNet, CIFAR-10, Pipa face identification,
    MultiMNIST, MNIST.
  • Text - IMDB reviews datasets, Reddit (coming)

TODO: Face recognition, eg Celeba or VGG. We already have some code, but need
expertise on producing good models (good place to contribute!).


  • Input perturbation -
    NeuralCleanse +
    added evasion.
  • Model anomalies - SentiNet + added
  • Spectral clustering / fine-pruning + added evasion.

TODO: Port Jupyter notebooks demonstrating defenses and evasions. Add new
defenses and evasions (good place to contribute!).

Training regimes

  • Centralized training.
  • Differentially private / gradient shaping training.
  • Federated Learning (CIFAR-10 only).


First, we want to give some background on backdoor attacks, note that our
definition is inclusive of many other definitions stated before and supports
all the new attacks (e.g. clean-label, feature-mix, semantic).

  1. Deep Learning. We focus on supervised learning setting where our goal is to
    learn some task m: X -> Y (we call it a main task) on some
    domain of
    inputs X
    and labels Y.
    A model θ for task m is trained on tuples (x,y) ∈ (X,Y) using
    some loss criterion L (e.g. cross-entropy): L(θ(x), y).

  2. Backdoor definition. A backdoor introduces malicious behavior
    m* additional
    to the main behavior m the model is trained for. Therefore, we state
    that a backdoor attack is
    essentially a multi-task setting with two or more tasks: main task m
    and backdoor task ***m****, and if needed evasion tasks ***mev
    ***. The model trained for two tasks will exhibit both normal and
    backdoor behavior.

  3. Backdoor data. In order to introduce a backdoor task
    m*: X* -> Y*
    the model has to be trained on a different domain of backdoor inputs and
    labels: (X*, Y*). Intuitively we can
    differentiate that the backdoor domain X* contains
    inputs that contain backdoor features. The main domain X might also
    include backdoor inputs, i.e. when backdoors are naturally occurring features.
    However, note that the
    input domain X* should not prevail in the main task domain
    X, e.g. X \ X* ≈ 0, otherwise two tasks will collude.

  4. Backdoor feature. Initially, a backdoor trigger was defined as a pixel
    pattern, therefore clearly separating the backdoor domain X*
    from the main domain X. However, recent works on semantic backdoors,
    edge-case backdoors and physical backdoors allow the backdoor feature to be
    a part of the unmodified input (ie. a particular model of a car or an
    airplane that will be misclassified as birds).

    We propose to use synthesizers that transform non
    -backdoored inputs
    to contain backdoor features and create backdoor labels. For example in
    image backdoors. The input synthesizer can simply insert a pixel pattern
    on top of an image,
    perform more complex transformations, or substitute the image with a
    backdoored image (edge-case backdoors).

  5. Complex backdoors. A domain of backdoor labels Y* can
    contain many labels. This setting is different from all other
    backdoor attacks, where the presence of a backdoor feature would always result
    in a specific label. However, our setting allows a new richer set of attacks
    for example a model trained on a task to count people
    in the image might contain a backdoor task to identify particular


  1. Supporting multiple backdoors. Our definition enables multiple
    backdoor tasks. As a toy example we can attack a model that recognizes a two
    number and inject two new backdoor tasks: one that sums up digits and another
    one that multiplies them.


  1. Methods to inject backdoor task. Depending on a selected threat
    model the attack can inject backdoors by
    poisoning the training dataset, directly mixing backdoor inputs into a
    training batch, altering loss functions, or modifying model weights. Our
    framework supports all these methods, but primarily focuses on injecting
    backdoors by adding a special loss value. We also utilize Multiple
    Gradient Descent Algorithm (MGDA) to
    efficiently balance multiple losses.


Now, let's configure the system:

  • Install all dependencies: pip install -r requirements.txt.
  • Create two directories: runs for Tensorboard graphs and saved_models to
    store results.
  • Startup Tensorboard: tensorboard --logdir=runs/.

Next, let's run some basic attack on MNIST dataset. We use YAML files to
configure the attacks. For MNIST attack, please refer to the configs /mnist_params.yaml file. For the full set of
parameters see the
dataclass Parameters. Let's start the training:

python --name mnist --params configs/mnist_params.yaml --commit none

Argument name specifies Tensorboard name and commit just records the
commit id into a log file for reproducibility.


Our framework includes a training file that
heavily relies on a Helper object storing all the necessary
objects for training. The helper object contains the main
Task that stores models, datasets, optimizers, and
other parameters for the training. Another object Attack contains synthesizers and performs
loss computation for multiple tasks.