This repo contains a PyTorch implementation of learning rate dropout from the paper "Learning Rate Dropout" by Lin et al.
To train a ResNet34 model on CIFAR-10 with the paper's hyperparameters, do
python main.py --lr=.1 --lr_dropout_rate=0.5
The original code is from the pytorch-cifar repo. It uses track-ml for logging metrics. This implementation doesn't add standard dropout.
The vanilla method is from
pytorch-cifar: SGD with
lr=.1, momentum=.9, weight_decay=5e-4, batch_size=128. The SGD-LRD method uses
lr_dropout_rate=0.5. I ran four trials for each method.
It looks like LRD helps in the beginning of training, but does not provide major boosts after applying the LR schedule. Here are the final test accuracies:
Shorty after this repo was published, the authors created an official repo for their paper here. The only differences I could find between the implementations are:
- The official code uses
torch.bernoullifor the mask while I use
(torch.rand_like(...) < lr_dropout_rate).type(d_p.dtype).
- I use in-place elementwise-multiplication (
.mul_) while they use
- They clone
bufbefore adding it to the parameters.
- They multiply the LR and mask before adding it to the parameters, while I wait until the end and do
It's unclear why these small differences would lead to such a large gap in performance between the implementations.