In D4RL and RL Unplugged: Benchmarks for Offline Reinforcement Learning, we released a suite of benchmarks for offline reinforcement learning. They are designed to facilitate ease of use, so we provided the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

Here, we release policies which can be used in conjunction with the RL Unplugged datasets to facilitate off-policy evaluation and offline model selection benchmarking.

In this release, we provide:

  • Policies for the tasks in the DeepMind Locomotion and Control Suite datasets (described below).

  • Policies trained with the following algorithms (D4PG, ABM, CRR and BC) and snapshots along the training trajectory. This faciliates benchmarking offline model selection.

The policies are available under gs://offline-rl/evaluation, with the D4RL policies provided in the subdirectory gs://offline-rl/evaluation/d4rl.

Task Descriptions

DeepMind Locomotion Dataset

These tasks are made up of the corridor locomotion tasks involving the CMU
Humanoid, for which prior efforts have either used motion capture data
([Merel et al., 2019a], [Merel et al., 2019b]) or training from scratch
([Song et al., 2020]). In addition, the DM Locomotion repository contains a set of
tasks adapted to be suited to a virtual rodent ([Merel et al., 2020]). We
emphasize that the DM Locomotion tasks feature the combination of challenging
high-DoF continuous control along with perception from rich egocentric
observations. For details on how the dataset was generated, please refer to
RL Unplugged: Benchmarks for Offline Reinforcement Learning.

DeepMind Control Suite Dataset

DeepMind Control Suite ([Tassa et al., 2018]) is a set of control tasks
implemented in MuJoCo ([Todorov et al., 2012]). We consider a subset of the tasks
provided in the suite that cover a wide range of difficulties.

Most of the datasets in this domain are generated using D4PG. For the
environments Manipulator insert ball and Manipulator insert peg we use V-MPO
([Song et al., 2020]) to generate the data as D4PG is unable to solve these tasks.
We release datasets for 9 control suite tasks. For details on how the dataset
was generated, please refer to
RL Unplugged: Benchmarks for Offline Reinforcement Learning.

Using the policies

The policies.json file provides metadata about the policies in this dataset.
It is structured as a list of dictionaries, one for each policy, where the keys
contain metadata including:

  • policy_path: The path to the policy on Google Cloud Storage.

  • task.task_name: The task that the policy is trained for.

  • agent_name: The training algorithm used to learn the policy.

  • snapshot_name: Contains the learning step for this policy snapshot.

  • return_mean: The mean return estimated with Monte Carlo rollouts.

  • return_sem: The standard error of the mean estimate.


  • Install dependencies: pip install -r requirements.txt
  • (Optional) Setup MuJoCo license key for DM Control environments

Policy loading example

Policies are stored as
TensorFlow SavedModels. Calling
the policy on an observation returns an action sample. See for an example of loading a policy.

Compute evaluation metrics

TODO Fill in example computing groundtruth and evaluation metrics.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search
engines such as Google Dataset Search.

property value
name Benchmarks for Deep Off-Policy Evaluation
description Data accompanying [Benchmarks for Deep Off-Policy Evaluation]().
property value
name Google