# CosineAnnealingWithWarmup

## Formulation

The learning rate is annealed using a cosine schedule over the course of learning of *n_total* total steps with an initial warmup period of *n_warmup* steps. Hence, the learning rate at step *i* is computed as:

Learning rate will be changed as:

## Usage

```
# optimizer, warmup_epochs, warmup_lr, num_epochs, base_lr, final_lr, iter_per_epoch
lr_scheduler = LR_Scheduler(
optimizer,
args.warmup_epochs, args.warmup_lr*args.batch_size/256,
args.epochs, args.lr*args.batch_size/256, args.final_lr*args.batch_size/256,
len(train_loader),
)
for data in range(train_loader):
optimizer.zero_grad()
output = model(data)
loss = lossfunc(output,gt)
loss.backward()
optimizer.step()
lr_scheduler.step()
```

In CV domain [1,2], in order to automatically adapt different batch size you can use a learning rate of lr×BatchSize/256 (linear scaling [4])(we can use larger learning rate while adopting larger batch size, especially, when you use LARS optimizer[3]). Of course, you can modify it according to your specific requirements.

## Reference

