Apollo Optimizer in Tensorflow 2.x
- Warmup is important with Apollo optimizer, so be sure to pass in a learning rate schedule vs. a constant learning rate for
learning_rate. One cycle scheduler is given as an example in one_cycle_lr_schedule.py
- To clip gradient norms as in paper, add either
clipnorm(parameter-wise clipping by norm) or
global_clipnormto the arguments (for example
- Decoupled weight decay is used by default.