add training code

052361de · ofirnachum · 9b969ca5 · 052361de · 052361de · 052361de
Commit 052361de authored Dec 05, 2018 by ofirnachum
20 changed files
--- a/research/efficient-hrl/README.md
+++ b/research/efficient-hrl/README.md
-Code for performing Hierarchical RL based on
+Code for performing Hierarchical RL based on the following publications:
 "Data-Efficient Hierarchical Reinforcement Learning" by
 Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
 (https://arxiv.org/abs/1805.08296).
+"Near-Optimal Representation Learning for Hierarchical Reinforcement Learning"
-This library currently includes three of the environments used:
+by Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
-Ant Maze, Ant Push, and Ant Fall.
+(https://arxiv.org/abs/1810.01257).
-The training code is planned to be open-sourced at a later time.
 Requirements:
 * TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
+* Gin Config (see https://github.com/google/gin-config)
+* Tensorflow Agents (see https://github.com/tensorflow/agents)
 * OpenAI Gym (see http://gym.openai.com/docs, be sure to install MuJoCo as well)
 * NumPy (see http://www.numpy.org/)
 Quick Start:
-Run a random policy on AntMaze (or AntPush, AntFall):
+Run a training job based on the original HIRO paper on Ant Maze:
+```
+python scripts/local_train.py test1 hiro_orig ant_maze base_uvf suite
+```
+Run a continuous evaluation job for that experiment:
 ```
-python environments/__init__.py --env=AntMaze
+python scripts/local_eval.py test1 hiro_orig ant_maze base_uvf suite
 ```
+To run the same experiment with online representation learning (the
+"Near-Optimal" paper), change `hiro_orig` to `hiro_repr`.
+You can also run with `hiro_xy` to run the same experiment with HIRO on only the
+xy coordinates of the agent.
+To run on other environments, change `ant_maze` to something else; e.g.,
+`ant_push_multi`, `ant_fall_multi`, etc.  See `context/configs/*` for other options.
+Basic Code Guide:
+The code for training resides in train.py.  The code trains a lower-level policy
+(a UVF agent in the code) and a higher-level policy (a MetaAgent in the code)
+concurrently.  The higher-level policy communicates goals to the lower-level
+policy.  In the code, this is called a context.  Not only does the lower-level
+policy act with respect to a context (a higher-level specified goal), but the
+higher-level policy also acts with respect to an environment-specified context
+(corresponding to the navigation target location associated with the task).
+Therefore, in `context/configs/*` you will find both specifications for task setup
+as well as goal configurations.  Most remaining hyperparameters used for
+training/evaluation may be found in `configs/*`.
+NOTE: Not all the code corresponding to the "Near-Optimal" paper is included.
+Namely, changes to low-level policy training proposed in the paper (discounting
+and auxiliary rewards) are not implemented here.  Performance should not change
+significantly.
 Maintained by Ofir Nachum (ofirnachum).
--- a/research/efficient-hrl/agent.py
+++ b/research/efficient-hrl/agent.py
--- a/research/efficient-hrl/agents/__init__.py
+++ b/research/efficient-hrl/agents/__init__.py
--- a/research/efficient-hrl/agents/circular_buffer.py
+++ b/research/efficient-hrl/agents/circular_buffer.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A circular buffer where each element is a list of tensors.
+Each element of the buffer is a list of tensors. An example use case is a replay
+buffer in reinforcement learning, where each element is a list of tensors
+representing the state, action, reward etc.
+New elements are added sequentially, and once the buffer is full, we
+start overwriting them in a circular fashion. Reading does not remove any
+elements, only adding new elements does.
+"""
+import collections
+import numpy as np
+import tensorflow as tf
+import gin.tf
+@gin.configurable
+class CircularBuffer(object):
+  """A circular buffer where each element is a list of tensors."""
+  def __init__(self, buffer_size=1000, scope='replay_buffer'):
+    """Circular buffer of list of tensors.
+    Args:
+      buffer_size: (integer) maximum number of tensor lists the buffer can hold.
+      scope: (string) variable scope for creating the variables.
+    """
+    self._buffer_size = np.int64(buffer_size)
+    self._scope = scope
+    self._tensors = collections.OrderedDict()
+    with tf.variable_scope(self._scope):
+      self._num_adds = tf.Variable(0, dtype=tf.int64, name='num_adds')
+    self._num_adds_cs = tf.contrib.framework.CriticalSection(name='num_adds')
+  @property
+  def buffer_size(self):
+    return self._buffer_size
+  @property
+  def scope(self):
+    return self._scope
+  @property
+  def num_adds(self):
+    return self._num_adds
+  def _create_variables(self, tensors):
+    with tf.variable_scope(self._scope):
+      for name in tensors.keys():
+        tensor = tensors[name]
+        self._tensors[name] = tf.get_variable(
+            name='BufferVariable_' + name,
+            shape=[self._buffer_size] + tensor.get_shape().as_list(),
+            dtype=tensor.dtype,
+            trainable=False)
+  def _validate(self, tensors):
+    """Validate shapes of tensors."""
+    if len(tensors) != len(self._tensors):
+      raise ValueError('Expected tensors to have %d elements. Received %d '
+                       'instead.' % (len(self._tensors), len(tensors)))
+    if self._tensors.keys() != tensors.keys():
+      raise ValueError('The keys of tensors should be the always the same.'
+                       'Received %s instead %s.' %
+                       (tensors.keys(), self._tensors.keys()))
+    for name, tensor in tensors.items():
+      if tensor.get_shape().as_list() != self._tensors[
+          name].get_shape().as_list()[1:]:
+        raise ValueError('Tensor %s has incorrect shape.' % name)
+      if not tensor.dtype.is_compatible_with(self._tensors[name].dtype):
+        raise ValueError(
+            'Tensor %s has incorrect data type. Expected %s, received %s' %
+            (name, self._tensors[name].read_value().dtype, tensor.dtype))
+  def add(self, tensors):
+    """Adds an element (list/tuple/dict of tensors) to the buffer.
+    Args:
+      tensors: (list/tuple/dict of tensors) to be added to the buffer.
+    Returns:
+      An add operation that adds the input `tensors` to the buffer. Similar to
+        an enqueue_op.
+    Raises:
+      ValueError: If the shapes and data types of input `tensors' are not the
+        same across calls to the add function.
+    """
+    return self.maybe_add(tensors, True)
+  def maybe_add(self, tensors, condition):
+    """Adds an element (tensors) to the buffer based on the condition..
+    Args:
+      tensors: (list/tuple of tensors) to be added to the buffer.
+      condition: A boolean Tensor controlling whether the tensors would be added
+        to the buffer or not.
+    Returns:
+      An add operation that adds the input `tensors` to the buffer. Similar to
+        an maybe_enqueue_op.
+    Raises:
+      ValueError: If the shapes and data types of input `tensors' are not the
+        same across calls to the add function.
+    """
+    if not isinstance(tensors, dict):
+      names = [str(i) for i in range(len(tensors))]
+      tensors = collections.OrderedDict(zip(names, tensors))
+    if not isinstance(tensors, collections.OrderedDict):
+      tensors = collections.OrderedDict(
+          sorted(tensors.items(), key=lambda t: t[0]))
+    if not self._tensors:
+      self._create_variables(tensors)
+    else:
+      self._validate(tensors)
+    #@tf.critical_section(self._position_mutex)
+    def _increment_num_adds():
+      # Adding 0 to the num_adds variable is a trick to read the value of the
+      # variable and return a read-only tensor. Doing this in a critical
+      # section allows us to capture a snapshot of the variable that will
+      # not be affected by other threads updating num_adds.
+      return self._num_adds.assign_add(1) + 0
+    def _add():
+      num_adds_inc = self._num_adds_cs.execute(_increment_num_adds)
+      current_pos = tf.mod(num_adds_inc - 1, self._buffer_size)
+      update_ops = []
+      for name in self._tensors.keys():
+        update_ops.append(
+            tf.scatter_update(self._tensors[name], current_pos, tensors[name]))
+      return tf.group(*update_ops)
+    return tf.contrib.framework.smart_cond(condition, _add, tf.no_op)
+  def get_random_batch(self, batch_size, keys=None, num_steps=1):
+    """Samples a batch of tensors from the buffer with replacement.
+    Args:
+      batch_size: (integer) number of elements to sample.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+      num_steps: (integer) length of trajectories to return. If > 1 will return
+        a list of lists, where each internal list represents a trajectory of
+        length num_steps.
+    Returns:
+      A list of tensors, where each element in the list is a batch sampled from
+        one of the tensors in the buffer.
+    Raises:
+      ValueError: If get_random_batch is called before calling the add function.
+      tf.errors.InvalidArgumentError: If this operation is executed before any
+        items are added to the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before get_random_batch.')
+    if keys is None:
+      keys = self._tensors.keys()
+    latest_start_index = self.get_num_adds() - num_steps + 1
+    empty_buffer_assert = tf.Assert(
+        tf.greater(latest_start_index, 0),
+        ['Not enough elements have been added to the buffer.'])
+    with tf.control_dependencies([empty_buffer_assert]):
+      max_index = tf.minimum(self._buffer_size, latest_start_index)
+      indices = tf.random_uniform(
+          [batch_size],
+          minval=0,
+          maxval=max_index,
+          dtype=tf.int64)
+      if num_steps == 1:
+        return self.gather(indices, keys)
+      else:
+        return self.gather_nstep(num_steps, indices, keys)
+  def gather(self, indices, keys=None):
+    """Returns elements at the specified indices from the buffer.
+    Args:
+      indices: (list of integers or rank 1 int Tensor) indices in the buffer to
+        retrieve elements from.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+    Returns:
+      A list of tensors, where each element in the list is obtained by indexing
+        one of the tensors in the buffer.
+    Raises:
+      ValueError: If gather is called before calling the add function.
+      tf.errors.InvalidArgumentError: If indices are bigger than the number of
+        items in the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before calling gather.')
+    if keys is None:
+      keys = self._tensors.keys()
+    with tf.name_scope('Gather'):
+      index_bound_assert = tf.Assert(
+          tf.less(
+              tf.to_int64(tf.reduce_max(indices)),
+              tf.minimum(self.get_num_adds(), self._buffer_size)),
+          ['Index out of bounds.'])
+      with tf.control_dependencies([index_bound_assert]):
+        indices = tf.convert_to_tensor(indices)
+      batch = []
+      for key in keys:
+        batch.append(tf.gather(self._tensors[key], indices, name=key))
+      return batch
+  def gather_nstep(self, num_steps, indices, keys=None):
+    """Returns elements at the specified indices from the buffer.
+    Args:
+      num_steps: (integer) length of trajectories to return.
+      indices: (list of rank num_steps int Tensor) indices in the buffer to
+        retrieve elements from for multiple trajectories. Each Tensor in the
+        list represents the indices for a trajectory.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+    Returns:
+      A list of list-of-tensors, where each element in the list is obtained by
+        indexing one of the tensors in the buffer.
+    Raises:
+      ValueError: If gather is called before calling the add function.
+      tf.errors.InvalidArgumentError: If indices are bigger than the number of
+        items in the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before calling gather.')
+    if keys is None:
+      keys = self._tensors.keys()
+    with tf.name_scope('Gather'):
+      index_bound_assert = tf.Assert(
+          tf.less_equal(
+              tf.to_int64(tf.reduce_max(indices) + num_steps),
+              self.get_num_adds()),
+          ['Trajectory indices go out of bounds.'])
+      with tf.control_dependencies([index_bound_assert]):
+        indices = tf.map_fn(
+            lambda x: tf.mod(tf.range(x, x + num_steps), self._buffer_size),
+            indices,
+            dtype=tf.int64)
+      batch = []
+      for key in keys:
+        def SampleTrajectories(trajectory_indices, key=key,
+                               num_steps=num_steps):
+          trajectory_indices.set_shape([num_steps])
+          return tf.gather(self._tensors[key], trajectory_indices, name=key)
+        batch.append(tf.map_fn(SampleTrajectories, indices,
+                               dtype=self._tensors[key].dtype))
+      return batch
+  def get_position(self):
+    """Returns the position at which the last element was added.
+    Returns:
+      An int tensor representing the index at which the last element was added
+        to the buffer or -1 if no elements were added.
+    """
+    return tf.cond(self.get_num_adds() < 1,
+                   lambda: self.get_num_adds() - 1,
+                   lambda: tf.mod(self.get_num_adds() - 1, self._buffer_size))
+  def get_num_adds(self):
+    """Returns the number of additions to the buffer.
+    Returns:
+      An int tensor representing the number of elements that were added.
+    """
+    def num_adds():
+      return self._num_adds.value()
+    return self._num_adds_cs.execute(num_adds)
+  def get_num_tensors(self):
+    """Returns the number of tensors (slots) in the buffer."""
+    return len(self._tensors)
--- a/research/efficient-hrl/agents/ddpg_agent.py
+++ b/research/efficient-hrl/agents/ddpg_agent.py
--- a/research/efficient-hrl/agents/ddpg_networks.py
+++ b/research/efficient-hrl/agents/ddpg_networks.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Sample actor(policy) and critic(q) networks to use with DDPG/NAF agents.
+The DDPG networks are defined in "Section 7: Experiment Details" of
+"Continuous control with deep reinforcement learning" - Lilicrap et al.
+https://arxiv.org/abs/1509.02971
+The NAF critic network is based on "Section 4" of "Continuous deep Q-learning
+with model-based acceleration" - Gu et al. https://arxiv.org/pdf/1603.00748.
+"""
+import tensorflow as tf
+slim = tf.contrib.slim
+import gin.tf
+@gin.configurable('ddpg_critic_net')
+def critic_net(states, actions,
+               for_critic_loss=False,
+               num_reward_dims=1,
+               states_hidden_layers=(400,),
+               actions_hidden_layers=None,
+               joint_hidden_layers=(300,),
+               weight_decay=0.0001,
+               normalizer_fn=None,
+               activation_fn=tf.nn.relu,
+               zero_obs=False,
+               images=False):
+  """Creates a critic that returns q values for the given states and actions.
+  Args:
+    states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
+      representing a batch of states.
+    actions: (castable to tf.float32) a [batch_size, num_action_dims] tensor
+      representing a batch of actions.
+    num_reward_dims: Number of reward dimensions.
+    states_hidden_layers: tuple of hidden layers units for states.
+    actions_hidden_layers: tuple of hidden layers units for actions.
+    joint_hidden_layers: tuple of hidden layers units after joining states
+      and actions using tf.concat().
+    weight_decay: Weight decay for l2 weights regularizer.
+    normalizer_fn: Normalizer function, i.e. slim.layer_norm,
+    activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
+  Returns:
+    A tf.float32 [batch_size] tensor of q values, or a tf.float32
+      [batch_size, num_reward_dims] tensor of vector q values if
+      num_reward_dims > 1.
+  """
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_regularizer=slim.l2_regularizer(weight_decay),
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+    orig_states = tf.to_float(states)
+    #states = tf.to_float(states)
+    states = tf.concat([tf.to_float(states), tf.to_float(actions)], -1)  #TD3
+    if images or zero_obs:
+      states *= tf.constant([0.0] * 2 + [1.0] * (states.shape[1] - 2))  #LALA
+    actions = tf.to_float(actions)
+    if states_hidden_layers:
+      states = slim.stack(states, slim.fully_connected, states_hidden_layers,
+                          scope='states')
+    if actions_hidden_layers:
+      actions = slim.stack(actions, slim.fully_connected, actions_hidden_layers,
+                           scope='actions')
+    joint = tf.concat([states, actions], 1)
+    if joint_hidden_layers:
+      joint = slim.stack(joint, slim.fully_connected, joint_hidden_layers,
+                         scope='joint')
+    with slim.arg_scope([slim.fully_connected],
+                        weights_regularizer=None,
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      value = slim.fully_connected(joint, num_reward_dims,
+                                   activation_fn=None,
+                                   normalizer_fn=None,
+                                   scope='q_value')
+    if num_reward_dims == 1:
+      value = tf.reshape(value, [-1])
+    if not for_critic_loss and num_reward_dims > 1:
+      value = tf.reduce_sum(
+          value * tf.abs(orig_states[:, -num_reward_dims:]), -1)
+  return value
+@gin.configurable('ddpg_actor_net')
+def actor_net(states, action_spec,
+              hidden_layers=(400, 300),
+              normalizer_fn=None,
+              activation_fn=tf.nn.relu,
+              zero_obs=False,
+              images=False):
+  """Creates an actor that returns actions for the given states.
+  Args:
+    states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
+      representing a batch of states.
+    action_spec: (BoundedTensorSpec) A tensor spec indicating the shape
+      and range of actions.
+    hidden_layers: tuple of hidden layers units.
+    normalizer_fn: Normalizer function, i.e. slim.layer_norm,
+    activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
+  Returns:
+    A tf.float32 [batch_size, num_action_dims] tensor of actions.
+  """
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+    states = tf.to_float(states)
+    orig_states = states
+    if images or zero_obs:  # Zero-out x, y position. Hacky.
+      states *= tf.constant([0.0] * 2 + [1.0] * (states.shape[1] - 2))
+    if hidden_layers:
+      states = slim.stack(states, slim.fully_connected, hidden_layers,
+                          scope='states')
+    with slim.arg_scope([slim.fully_connected],
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      actions = slim.fully_connected(states,
+                                     action_spec.shape.num_elements(),
+                                     scope='actions',
+                                     normalizer_fn=None,
+                                     activation_fn=tf.nn.tanh)
+      action_means = (action_spec.maximum + action_spec.minimum) / 2.0
+      action_magnitudes = (action_spec.maximum - action_spec.minimum) / 2.0
+      actions = action_means + action_magnitudes * actions
+  return actions
--- a/research/efficient-hrl/cond_fn.py
+++ b/research/efficient-hrl/cond_fn.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Defines many boolean functions indicating when to step and reset.
+"""
+import tensorflow as tf
+import gin.tf
+@gin.configurable
+def env_transition(agent, state, action, transition_type, environment_steps,
+                   num_episodes):
+  """True if the transition_type is TRANSITION or FINAL_TRANSITION.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: Returns an op that evaluates to true if the transition type is
+    not RESTARTING
+  """
+  del agent, state, action, num_episodes, environment_steps
+  cond = tf.logical_not(transition_type)
+  return cond
+@gin.configurable
+def env_restart(agent, state, action, transition_type, environment_steps,
+                num_episodes):
+  """True if the transition_type is RESTARTING.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: Returns an op that evaluates to true if the transition type equals
+    RESTARTING.
+  """
+  del agent, state, action, num_episodes, environment_steps
+  cond = tf.identity(transition_type)
+  return cond
+@gin.configurable
+def every_n_steps(agent,
+                  state,
+                  action,
+                  transition_type,
+                  environment_steps,
+                  num_episodes,
+                  n=150):
+  """True once every n steps.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    n: Return true once every n steps.
+  Returns:
+    cond: Returns an op that evaluates to true if environment_steps
+    equals 0 mod n. We increment the step before checking this condition, so
+    we do not need to add one to environment_steps.
+  """
+  del agent, state, action, transition_type, num_episodes
+  cond = tf.equal(tf.mod(environment_steps, n), 0)
+  return cond
+@gin.configurable
+def every_n_episodes(agent,
+                     state,
+                     action,
+                     transition_type,
+                     environment_steps,
+                     num_episodes,
+                     n=2,
+                     steps_per_episode=None):
+  """True once every n episodes.
+  Specifically, evaluates to True on the 0th step of every nth episode.
+  Unlike environment_steps, num_episodes starts at 0, so we do want to add
+  one to ensure it does not reset on the first call.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    n: Return true once every n episodes.
+    steps_per_episode: How many steps per episode. Needed to determine when a
+    new episode starts.
+  Returns:
+    cond: Returns an op that evaluates to true on the last step of the episode
+      (i.e. if num_episodes equals 0 mod n).
+  """
+  assert steps_per_episode is not None
+  del agent, action, transition_type
+  ant_fell = tf.logical_or(state[2] < 0.2, state[2] > 1.0)
+  cond = tf.logical_and(
+      tf.logical_or(
+          ant_fell,
+          tf.equal(tf.mod(num_episodes + 1, n), 0)),
+      tf.equal(tf.mod(environment_steps, steps_per_episode), 0))
+  return cond
+@gin.configurable
+def failed_reset_after_n_episodes(agent,
+                                  state,
+                                  action,
+                                  transition_type,
+                                  environment_steps,
+                                  num_episodes,
+                                  steps_per_episode=None,
+                                  reset_state=None,
+                                  max_dist=1.0,
+                                  epsilon=1e-10):
+  """Every n episodes, returns True if the reset agent fails to return.
+  Specifically, evaluates to True if the distance between the state and the
+  reset state is greater than max_dist at the end of the episode.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    steps_per_episode: How many steps per episode. Needed to determine when a
+    new episode starts.
+    reset_state: State to which the reset controller should return.
+    max_dist: Agent is considered to have successfully reset if its distance
+    from the reset_state is less than max_dist.
+    epsilon: small offset to ensure non-negative/zero distance.
+  Returns:
+    cond: Returns an op that evaluates to true if num_episodes+1 equals 0
+    mod n. We add one to the num_episodes so the environment is not reset after
+    the 0th step.
+  """
+  assert steps_per_episode is not None
+  assert reset_state is not None
+  del agent, state, action, transition_type, num_episodes
+  dist = tf.sqrt(
+      tf.reduce_sum(tf.squared_difference(state, reset_state)) + epsilon)
+  cond = tf.logical_and(
+      tf.greater(dist, tf.constant(max_dist)),
+      tf.equal(tf.mod(environment_steps, steps_per_episode), 0))
+  return cond
+@gin.configurable
+def q_too_small(agent,
+                state,
+                action,
+                transition_type,
+                environment_steps,
+                num_episodes,
+                q_min=0.5):
+  """True of q is too small.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    q_min: Returns true if the qval is less than q_min
+  Returns:
+    cond: Returns an op that evaluates to true if qval is less than q_min.
+  """
+  del transition_type, environment_steps, num_episodes
+  state_for_reset_agent = tf.stack(state[:-1], tf.constant([0], dtype=tf.float))
+  qval = agent.BASE_AGENT_CLASS.critic_net(
+      tf.expand_dims(state_for_reset_agent, 0), tf.expand_dims(action, 0))[0, :]
+  cond = tf.greater(tf.constant(q_min), qval)
+  return cond
+@gin.configurable
+def true_fn(agent, state, action, transition_type, environment_steps,
+            num_episodes):
+  """Returns an op that evaluates to true.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: op that always evaluates to True.
+  """
+  del agent, state, action, transition_type, environment_steps, num_episodes
+  cond = tf.constant(True, dtype=tf.bool)
+  return cond
+@gin.configurable
+def false_fn(agent, state, action, transition_type, environment_steps,
+             num_episodes):
+  """Returns an op that evaluates to false.
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: op that always evaluates to False.
+  """
+  del agent, state, action, transition_type, environment_steps, num_episodes
+  cond = tf.constant(False, dtype=tf.bool)
+  return cond
--- a/research/efficient-hrl/configs/base_uvf.gin
+++ b/research/efficient-hrl/configs/base_uvf.gin
+#-*-Python-*-
+import gin.tf.external_configurables
+create_maze_env.top_down_view = %IMAGES
+## Create the agent
+AGENT_CLASS = @UvfAgent
+UvfAgent.tf_context = %CONTEXT
+UvfAgent.actor_net = @agent/ddpg_actor_net
+UvfAgent.critic_net = @agent/ddpg_critic_net
+UvfAgent.dqda_clipping = 0.0
+UvfAgent.td_errors_loss = @tf.losses.huber_loss
+UvfAgent.target_q_clipping = %TARGET_Q_CLIPPING
+# Create meta agent
+META_CLASS = @MetaAgent
+MetaAgent.tf_context = %META_CONTEXT
+MetaAgent.sub_context = %CONTEXT
+MetaAgent.actor_net = @meta/ddpg_actor_net
+MetaAgent.critic_net = @meta/ddpg_critic_net
+MetaAgent.dqda_clipping = 0.0
+MetaAgent.td_errors_loss = @tf.losses.huber_loss
+MetaAgent.target_q_clipping = %TARGET_Q_CLIPPING
+# Create state preprocess
+STATE_PREPROCESS_CLASS = @StatePreprocess
+StatePreprocess.ndims = %SUBGOAL_DIM
+state_preprocess_net.states_hidden_layers = (100, 100)
+state_preprocess_net.num_output_dims = %SUBGOAL_DIM
+state_preprocess_net.images = %IMAGES
+action_embed_net.num_output_dims = %SUBGOAL_DIM
+INVERSE_DYNAMICS_CLASS = @InverseDynamics
+# actor_net
+ACTOR_HIDDEN_SIZE_1 = 300
+ACTOR_HIDDEN_SIZE_2 = 300
+agent/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
+agent/ddpg_actor_net.activation_fn = @tf.nn.relu
+agent/ddpg_actor_net.zero_obs = %ZERO_OBS
+agent/ddpg_actor_net.images = %IMAGES
+meta/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
+meta/ddpg_actor_net.activation_fn = @tf.nn.relu
+meta/ddpg_actor_net.zero_obs = False
+meta/ddpg_actor_net.images = %IMAGES
+# critic_net
+CRITIC_HIDDEN_SIZE_1 = 300
+CRITIC_HIDDEN_SIZE_2 = 300
+agent/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
+agent/ddpg_critic_net.actions_hidden_layers = None
+agent/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
+agent/ddpg_critic_net.weight_decay = 0.0
+agent/ddpg_critic_net.activation_fn = @tf.nn.relu
+agent/ddpg_critic_net.zero_obs = %ZERO_OBS
+agent/ddpg_critic_net.images = %IMAGES
+meta/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
+meta/ddpg_critic_net.actions_hidden_layers = None
+meta/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
+meta/ddpg_critic_net.weight_decay = 0.0
+meta/ddpg_critic_net.activation_fn = @tf.nn.relu
+meta/ddpg_critic_net.zero_obs = False
+meta/ddpg_critic_net.images = %IMAGES
+tf.losses.huber_loss.delta = 1.0
+# Sample action
+uvf_add_noise_fn.stddev = 1.0
+meta_add_noise_fn.stddev = %META_EXPLORE_NOISE
+# Update targets
+ddpg_update_targets.tau = 0.001
+td3_update_targets.tau = 0.005
--- a/research/efficient-hrl/configs/eval_uvf.gin
+++ b/research/efficient-hrl/configs/eval_uvf.gin
+#-*-Python-*-
+# Config eval
+evaluate.environment = @create_maze_env()
+evaluate.agent_class = %AGENT_CLASS
+evaluate.meta_agent_class = %META_CLASS
+evaluate.state_preprocess_class = %STATE_PREPROCESS_CLASS
+evaluate.num_episodes_eval = 50
+evaluate.num_episodes_videos = 1
+evaluate.gamma = 1.0
+evaluate.eval_interval_secs = 1
+evaluate.generate_videos = False
+evaluate.generate_summaries = True
+evaluate.eval_modes = %EVAL_MODES
+evaluate.max_steps_per_episode = %RESET_EPISODE_PERIOD
--- a/research/efficient-hrl/configs/train_uvf.gin
+++ b/research/efficient-hrl/configs/train_uvf.gin
+#-*-Python-*-
+# Create replay_buffer
+agent/CircularBuffer.buffer_size = 200000
+meta/CircularBuffer.buffer_size = 200000
+agent/CircularBuffer.scope = "agent"
+meta/CircularBuffer.scope = "meta"
+# Config train
+train_uvf.environment = @create_maze_env()
+train_uvf.agent_class = %AGENT_CLASS
+train_uvf.meta_agent_class = %META_CLASS
+train_uvf.state_preprocess_class = %STATE_PREPROCESS_CLASS
+train_uvf.inverse_dynamics_class = %INVERSE_DYNAMICS_CLASS
+train_uvf.replay_buffer = @agent/CircularBuffer()
+train_uvf.meta_replay_buffer = @meta/CircularBuffer()
+train_uvf.critic_optimizer = @critic/AdamOptimizer()
+train_uvf.actor_optimizer = @actor/AdamOptimizer()
+train_uvf.meta_critic_optimizer = @meta_critic/AdamOptimizer()
+train_uvf.meta_actor_optimizer = @meta_actor/AdamOptimizer()
+train_uvf.repr_optimizer = @repr/AdamOptimizer()
+train_uvf.num_episodes_train = 25000
+train_uvf.batch_size = 100
+train_uvf.initial_episodes = 5
+train_uvf.gamma = 0.99
+train_uvf.meta_gamma = 0.99
+train_uvf.reward_scale_factor = 1.0
+train_uvf.target_update_period = 2
+train_uvf.num_updates_per_observation = 1
+train_uvf.num_collect_per_update = 1
+train_uvf.num_collect_per_meta_update = 10
+train_uvf.debug_summaries = False
+train_uvf.log_every_n_steps = 1000
+train_uvf.save_policy_every_n_steps =100000
+# Config Optimizers
+critic/AdamOptimizer.learning_rate = 0.001
+critic/AdamOptimizer.beta1 = 0.9
+critic/AdamOptimizer.beta2 = 0.999
+actor/AdamOptimizer.learning_rate = 0.0001
+actor/AdamOptimizer.beta1 = 0.9
+actor/AdamOptimizer.beta2 = 0.999
+meta_critic/AdamOptimizer.learning_rate = 0.001
+meta_critic/AdamOptimizer.beta1 = 0.9
+meta_critic/AdamOptimizer.beta2 = 0.999
+meta_actor/AdamOptimizer.learning_rate = 0.0001
+meta_actor/AdamOptimizer.beta1 = 0.9
+meta_actor/AdamOptimizer.beta2 = 0.999
+repr/AdamOptimizer.learning_rate = 0.0001
+repr/AdamOptimizer.beta1 = 0.9
+repr/AdamOptimizer.beta2 = 0.999
--- a/research/efficient-hrl/context/__init__.py
+++ b/research/efficient-hrl/context/__init__.py
--- a/research/efficient-hrl/context/configs/ant_block.gin
+++ b/research/efficient-hrl/context/configs/ant_block.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntBlock"
+ZERO_OBS = False
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [3, 4]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_block_maze.gin
+++ b/research/efficient-hrl/context/configs/ant_block_maze.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntBlockMaze"
+ZERO_OBS = False
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (12, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [3, 4]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [8, 0]
+eval2/ConstantSampler.value = [8, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_fall_multi.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_multi.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [0, 27, 4.5]
--- a/research/efficient-hrl/context/configs/ant_fall_multi_img.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_multi_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+IMAGES = True
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 3
+task/relative_context_multi_transition_fn.k = 3
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [0, 27, 0]
--- a/research/efficient-hrl/context/configs/ant_fall_single.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_single.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@eval1/ConstantSampler],
+    "explore": [@eval1/ConstantSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [0, 27, 4.5]
--- a/research/efficient-hrl/context/configs/ant_maze.gin
+++ b/research/efficient-hrl/context/configs/ant_maze.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntMaze"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_maze_img.gin
+++ b/research/efficient-hrl/context/configs/ant_maze_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntMaze"
+IMAGES = True
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 2
+task/relative_context_multi_transition_fn.k = 2
+MetaAgent.k = %SUBGOAL_DIM
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_push_multi.gin
+++ b/research/efficient-hrl/context/configs/ant_push_multi.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntPush"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-16, -4), (16, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval2"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval2": [@eval2/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+eval2/ConstantSampler.value = [0, 19]
--- a/research/efficient-hrl/context/configs/ant_push_multi_img.gin
+++ b/research/efficient-hrl/context/configs/ant_push_multi_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntPush"
+IMAGES = True
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-16, -4), (16, 20))
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+## Config defaults
+EVAL_MODES = ["eval2"]
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+agent/Context.reward_fn = @uvf/negative_distance
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval2": [@eval2/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 2
+task/relative_context_multi_transition_fn.k = 2
+MetaAgent.k = %SUBGOAL_DIM
+eval2/ConstantSampler.value = [0, 19]