The Unsupervised Reinforcement Learning Benchmark (URLB)
URLB provides a set of leading algorithms for unsupervised reinforcement learning where agents first pre-train without access to extrinsic rewards and then are finetuned to downstream tasks.
We assume you have access to a GPU that can run CUDA 10.2 and CUDNN 8. Then, the simplest way to install all required dependencies is to create an anaconda environment by running
conda env create -f conda_env.yml
After the instalation ends you can activate your environment with
conda activate urlb
We support the following domains.
Domain observation mode
Each domain supports two observation modes: states and pixels.
To run pre-training use the
python pretrain.py agent=icm domain=walker
or, if you want to train a skill-based agent, like DIAYN, run:
python pretrain.py agent=diayn domain=walker
This script will produce several agent snapshots after training for
2M frames. The snapshots will be stored under the following directory:
Once you have pre-trained your method, you can use the saved snapshots to initialize the
DDPG agent and fine-tune it on a downstream task. For example, let’s say you have pre-trained
ICM, you can fine-tune it on
walker_run by running the following command:
python finetune.py pretrained_agent=icm task=walker_run snapshot_ts=1000000 obs_type=states
This will load a snapshot stored in
DDPG with it (both the actor and critic), and start training on
walker_run using the extrinsic reward of the task.
For methods that use skills, include the agent, and the
reward_free tag to false.
python finetune.py pretrained_agent=smm task=walker_run snapshot_ts=1000000 obs_type=states agent=smm reward_free=false
Logs are stored in the
exp_local folder. To launch tensorboard run:
tensorboard --logdir exp_local
The console output is also available in a form:
| train | F: 6000 | S: 3000 | E: 6 | L: 1000 | R: 5.5177 | FPS: 96.7586 | T: 0:00:42
a training entry decodes as
F : total number of environment frames S : total number of agent steps E : total number of episodes R : episode return FPS: training throughput (frames per second) T : total training time