PyTorch RL Minimal Implementations
There are implementations of some reinforcement learning algorithms, whose characteristics are as follow:
- Less packages-based: Only PyTorch and Gym, for building neural networks and testing algorithms’ performance respectively, are necessary to install.
- Independent implementation: All RL algorithms are implemented in separate files, which facilitates to understand their processes and modify them to adapt to other tasks.
- Various expansion configurations: It’s convenient to configure various parameters and tools, such as reward normalization, advantage normalization, tensorboard, tqdm and so on.
RL Algorithms List
Name | Type | Estimator | Paper | File |
---|---|---|---|---|
Q-Learning | Value-based / Off policy | TD | Watkins et al. Q-Learning. Machine Learning, 1992 | q_learning.py |
REINFORCE | Policy-based On policy | MC | Sutton et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NeurIPS, 2000. | reinforce.py |
DQN | Value-based / Off policy | TD | Mnih et al. Human-level control through deep reinforcement learning. Nature, 2015. | doing |
A2C | Actor-Critic / On policy | n-step TD | Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. In ICML, 2016. | a2c.py |
A3C | Actor-Critic / On policy | n-step TD | .Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. In ICML, 2016 | a3c.py |
ACER | Actor-Critic / On policy | GAE | Wang et al. Sample Efficient Actor-Critic with Experience Replay. In ICLR, 2017. | doing |
ACKTR | Actor-Critic / On policy | GAE | Wu et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In NeurIPS, 2017. | doing |
PPO | Actor-Critic / On policy | GAE | Schulman et al. Proximal Policy Optimization Algorithms. arXiv, 2017. | ppo.py |
Quick Start
Requirements
pytorch
gym
tensorboard # for summary writer
tqdm # for process bar
Abstract Agent
Components / Parameters
Component | Description |
---|---|
policy | neural network model |
gamma | discount factor of cumulative reward |
lr | learning rate. i.e. lr_actor , lr_critic |
lr_decay | weight decay to schedule the learning rate |
lr_scheduler | scheduler for the learning rate |
coef_critic_loss | coefficient of critic loss |
coef_entropy_loss | coefficient of entropy loss |
writer | summary writer to record information |
buffer | replay buffer to store historical trajectories |
use_cuda | use GPU |
clip_grad | gradients clipping |
max_grad_norm | maximum norm of gradients clipped |
norm_advantage | advantage normalization |
open_tb | open summary writer |
open_tqdm | open process bar |
Methods
Methods | Description |
---|---|
preprocess_obs() | preprocess observation before input into the neural network |
select_action() | use actor network to select an action based on the policy distribution. |
estimate_obs() | use critic network to estimate the value of observation |
update() | update the parameter by calculate losses and gradients |
train() | set the neural network to train mode |
eval() | set the neural network to evaluate mode |
save() | save the model parameters |
load() | load the model parameters |
Update & To-do & Limitations
Update History
2021-12-09
ADD
TRICK
:norm_critic_loss in PPO2021-12-09
ADD
PARAM
: coef_critic_loss, coef_entropy_loss, log_step2021-12-07
ADD
ALGO
: A3C2021-12-05
ADD
ALGO
: PPO2021-11-28
ADD
ALGO
: A2C2021-11-20
ADD
ALGO
: Q learning, Reinforce
To-do List
-
ADD
ALGO
DQN, Double DQN, Dueling DQN, DDPG -
ADD
NN
RNN Mode
Current Limitations
- Unsupport
Vectorized environments
- Unsupport
Continuous action space
- Unsupport
RNN-based model
- Unsupport
Imatation learning