Merge pull request #5870 from ofirnachum/master

Add training and eval code for efficient-hrl

Merge pull request #5870 from ofirnachum/master
Add training and eval code for efficient-hrl
c9f03bf6 · Neal Wu · GitHub · 2c181308 · 052361de · c9f03bf6
Unverified Commit c9f03bf6 authored Dec 06, 2018 by Neal Wu Committed by GitHub Dec 06, 2018
20 changed files
--- a/research/efficient-hrl/README.md
+++ b/research/efficient-hrl/README.md
-Code for performing Hierarchical RL based on
+Code for performing Hierarchical RL based on the following publications:
+
 "Data-Efficient Hierarchical Reinforcement Learning" by
 Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
 (https://arxiv.org/abs/1805.08296).

-
-This library currently includes three of the environments used:
-Ant Maze, Ant Push, and Ant Fall.
-
-The training code is planned to be open-sourced at a later time.
+"Near-Optimal Representation Learning for Hierarchical Reinforcement Learning"
+by Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
+(https://arxiv.org/abs/1810.01257).


 Requirements:
 * TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
+* Gin Config (see https://github.com/google/gin-config)
+* Tensorflow Agents (see https://github.com/tensorflow/agents)
 * OpenAI Gym (see http://gym.openai.com/docs, be sure to install MuJoCo as well)
 * NumPy (see http://www.numpy.org/)


 Quick Start:

-Run a random policy on AntMaze (or AntPush, AntFall):
+Run a training job based on the original HIRO paper on Ant Maze:
+
+```
+python scripts/local_train.py test1 hiro_orig ant_maze base_uvf suite
+```
+
+Run a continuous evaluation job for that experiment:

 ```
-python environments/__init__.py --env=AntMaze
+python scripts/local_eval.py test1 hiro_orig ant_maze base_uvf suite
 ```

+To run the same experiment with online representation learning (the
+"Near-Optimal" paper), change `hiro_orig` to `hiro_repr`.
+You can also run with `hiro_xy` to run the same experiment with HIRO on only the
+xy coordinates of the agent.
+
+To run on other environments, change `ant_maze` to something else; e.g.,
+`ant_push_multi`, `ant_fall_multi`, etc.  See `context/configs/*` for other options.
+
+
+Basic Code Guide:
+
+The code for training resides in train.py.  The code trains a lower-level policy
+(a UVF agent in the code) and a higher-level policy (a MetaAgent in the code)
+concurrently.  The higher-level policy communicates goals to the lower-level
+policy.  In the code, this is called a context.  Not only does the lower-level
+policy act with respect to a context (a higher-level specified goal), but the
+higher-level policy also acts with respect to an environment-specified context
+(corresponding to the navigation target location associated with the task).
+Therefore, in `context/configs/*` you will find both specifications for task setup
+as well as goal configurations.  Most remaining hyperparameters used for
+training/evaluation may be found in `configs/*`.
+
+NOTE: Not all the code corresponding to the "Near-Optimal" paper is included.
+Namely, changes to low-level policy training proposed in the paper (discounting
+and auxiliary rewards) are not implemented here.  Performance should not change
+significantly.
+

 Maintained by Ofir Nachum (ofirnachum).
--- a/research/efficient-hrl/agent.py
+++ b/research/efficient-hrl/agent.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A UVF agent.
+"""
+
+import tensorflow as tf
+import gin.tf
+from agents import ddpg_agent
+# pylint: disable=unused-import
+import cond_fn
+from utils import utils as uvf_utils
+from context import gin_imports
+# pylint: enable=unused-import
+slim = tf.contrib.slim
+
+
+@gin.configurable
+class UvfAgentCore(object):
+  """Defines basic functions for UVF agent. Must be inherited with an RL agent.
+
+  Used as lower-level agent.
+  """
+
+  def __init__(self,
+               observation_spec,
+               action_spec,
+               tf_env,
+               tf_context,
+               step_cond_fn=cond_fn.env_transition,
+               reset_episode_cond_fn=cond_fn.env_restart,
+               reset_env_cond_fn=cond_fn.false_fn,
+               metrics=None,
+               **base_agent_kwargs):
+    """Constructs a UVF agent.
+
+    Args:
+      observation_spec: A TensorSpec defining the observations.
+      action_spec: A BoundedTensorSpec defining the actions.
+      tf_env: A Tensorflow environment object.
+      tf_context: A Context class.
+      step_cond_fn: A function indicating whether to increment the num of steps.
+      reset_episode_cond_fn: A function indicating whether to restart the
+      episode, resampling the context.
+      reset_env_cond_fn: A function indicating whether to perform a manual reset
+      of the environment.
+      metrics: A list of functions that evaluate metrics of the agent.
+      **base_agent_kwargs: A dictionary of parameters for base RL Agent.
+    Raises:
+      ValueError: If 'dqda_clipping' is < 0.
+    """
+    self._step_cond_fn = step_cond_fn
+    self._reset_episode_cond_fn = reset_episode_cond_fn
+    self._reset_env_cond_fn = reset_env_cond_fn
+    self.metrics = metrics
+
+    # expose tf_context methods
+    self.tf_context = tf_context(tf_env=tf_env)
+    self.set_replay = self.tf_context.set_replay
+    self.sample_contexts = self.tf_context.sample_contexts
+    self.compute_rewards = self.tf_context.compute_rewards
+    self.gamma_index = self.tf_context.gamma_index
+    self.context_specs = self.tf_context.context_specs
+    self.context_as_action_specs = self.tf_context.context_as_action_specs
+    self.init_context_vars = self.tf_context.create_vars
+
+    self.env_observation_spec = observation_spec[0]
+    merged_observation_spec = (uvf_utils.merge_specs(
+        (self.env_observation_spec,) + self.context_specs),)
+    self._context_vars = dict()
+    self._action_vars = dict()
+
+    self.BASE_AGENT_CLASS.__init__(
+        self,
+        observation_spec=merged_observation_spec,
+        action_spec=action_spec,
+        **base_agent_kwargs
+    )
+
+  def set_meta_agent(self, agent=None):
+    self._meta_agent = agent
+
+  @property
+  def meta_agent(self):
+    return self._meta_agent
+
+  def actor_loss(self, states, actions, rewards, discounts,
+                 next_states):
+    """Returns the next action for the state.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+      context: A list of [num_context_dims] tensor representing a context.
+    Returns:
+      A [num_action_dims] tensor representing the action.
+    """
+    return self.BASE_AGENT_CLASS.actor_loss(self, states)
+
+  def action(self, state, context=None):
+    """Returns the next action for the state.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+      context: A list of [num_context_dims] tensor representing a context.
+    Returns:
+      A [num_action_dims] tensor representing the action.
+    """
+    merged_state = self.merged_state(state, context)
+    return self.BASE_AGENT_CLASS.action(self, merged_state)
+
+  def actions(self, state, context=None):
+    """Returns the next action for the state.
+
+    Args:
+      state: A [-1, num_state_dims] tensor representing a state.
+      context: A list of [-1, num_context_dims] tensor representing a context.
+    Returns:
+      A [-1, num_action_dims] tensor representing the action.
+    """
+    merged_states = self.merged_states(state, context)
+    return self.BASE_AGENT_CLASS.actor_net(self, merged_states)
+
+  def log_probs(self, states, actions, state_reprs, contexts=None):
+    assert contexts is not None
+    batch_dims = [tf.shape(states)[0], tf.shape(states)[1]]
+    contexts = self.tf_context.context_multi_transition_fn(
+        contexts, states=tf.to_float(state_reprs))
+
+    flat_states = tf.reshape(states,
+                             [batch_dims[0] * batch_dims[1], states.shape[-1]])
+    flat_contexts = [tf.reshape(tf.cast(context, states.dtype),
+                                [batch_dims[0] * batch_dims[1], context.shape[-1]])
+                     for context in contexts]
+    flat_pred_actions = self.actions(flat_states, flat_contexts)
+    pred_actions = tf.reshape(flat_pred_actions,
+                              batch_dims + [flat_pred_actions.shape[-1]])
+
+    error = tf.square(actions - pred_actions)
+    spec_range = (self._action_spec.maximum - self._action_spec.minimum) / 2
+    normalized_error = error / tf.constant(spec_range) ** 2
+    return -normalized_error
+
+  @gin.configurable('uvf_add_noise_fn')
+  def add_noise_fn(self, action_fn, stddev=1.0, debug=False,
+                   clip=True, global_step=None):
+    """Returns the action_fn with additive Gaussian noise.
+
+    Args:
+      action_fn: A callable(`state`, `context`) which returns a
+        [num_action_dims] tensor representing a action.
+      stddev: stddev for the Ornstein-Uhlenbeck noise.
+      debug: Print debug messages.
+    Returns:
+      A [num_action_dims] action tensor.
+    """
+    if global_step is not None:
+      stddev *= tf.maximum(  # Decay exploration during training.
+          tf.train.exponential_decay(1.0, global_step, 1e6, 0.8), 0.5)
+    def noisy_action_fn(state, context=None):
+      """Noisy action fn."""
+      action = action_fn(state, context)
+      if debug:
+        action = uvf_utils.tf_print(
+            action, [action],
+            message='[add_noise_fn] pre-noise action',
+            first_n=100)
+      noise_dist = tf.distributions.Normal(tf.zeros_like(action),
+                                           tf.ones_like(action) * stddev)
+      noise = noise_dist.sample()
+      action += noise
+      if debug:
+        action = uvf_utils.tf_print(
+            action, [action],
+            message='[add_noise_fn] post-noise action',
+            first_n=100)
+      if clip:
+        action = uvf_utils.clip_to_spec(action, self._action_spec)
+      return action
+    return noisy_action_fn
+
+  def merged_state(self, state, context=None):
+    """Returns the merged state from the environment state and contexts.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+      context: A list of [num_context_dims] tensor representing a context.
+        If None, use the internal context.
+    Returns:
+      A [num_merged_state_dims] tensor representing the merged state.
+    """
+    if context is None:
+      context = list(self.context_vars)
+    state = tf.concat([state,] + context, axis=-1)
+    self._validate_states(self._batch_state(state))
+    return state
+
+  def merged_states(self, states, contexts=None):
+    """Returns the batch merged state from the batch env state and contexts.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      contexts: A list of [batch_size, num_context_dims] tensor
+        representing a batch of contexts. If None,
+        use the internal context.
+    Returns:
+      A [batch_size, num_merged_state_dims] tensor representing the batch
+        of merged states.
+    """
+    if contexts is None:
+      contexts = [tf.tile(tf.expand_dims(context, axis=0),
+                          (tf.shape(states)[0], 1)) for
+                  context in self.context_vars]
+    states = tf.concat([states,] + contexts, axis=-1)
+    self._validate_states(states)
+    return states
+
+  def unmerged_states(self, merged_states):
+    """Returns the batch state and contexts from the batch merged state.
+
+    Args:
+      merged_states: A [batch_size, num_merged_state_dims] tensor
+        representing a batch of merged states.
+    Returns:
+      A [batch_size, num_state_dims] tensor and a list of
+        [batch_size, num_context_dims] tensors representing the batch state
+        and contexts respectively.
+    """
+    self._validate_states(merged_states)
+    num_state_dims = self.env_observation_spec.shape.as_list()[0]
+    num_context_dims_list = [c.shape.as_list()[0] for c in self.context_specs]
+    states = merged_states[:, :num_state_dims]
+    contexts = []
+    i = num_state_dims
+    for num_context_dims in num_context_dims_list:
+      contexts.append(merged_states[:, i: i+num_context_dims])
+      i += num_context_dims
+    return states, contexts
+
+  def sample_random_actions(self, batch_size=1):
+    """Return random actions.
+
+    Args:
+      batch_size: Batch size.
+    Returns:
+      A [batch_size, num_action_dims] tensor representing the batch of actions.
+    """
+    actions = tf.concat(
+        [
+            tf.random_uniform(
+                shape=(batch_size, 1),
+                minval=self._action_spec.minimum[i],
+                maxval=self._action_spec.maximum[i])
+            for i in range(self._action_spec.shape[0].value)
+        ],
+        axis=1)
+    return actions
+
+  def clip_actions(self, actions):
+    """Clip actions to spec.
+
+    Args:
+      actions: A [batch_size, num_action_dims] tensor representing
+      the batch of actions.
+    Returns:
+      A [batch_size, num_action_dims] tensor representing the batch
+      of clipped actions.
+    """
+    actions = tf.concat(
+        [
+            tf.clip_by_value(
+                actions[:, i:i+1],
+                self._action_spec.minimum[i],
+                self._action_spec.maximum[i])
+            for i in range(self._action_spec.shape[0].value)
+        ],
+        axis=1)
+    return actions
+
+  def mix_contexts(self, contexts, insert_contexts, indices):
+    """Mix two contexts based on indices.
+
+    Args:
+      contexts: A list of [batch_size, num_context_dims] tensor representing
+      the batch of contexts.
+      insert_contexts: A list of [batch_size, num_context_dims] tensor
+      representing the batch of contexts to be inserted.
+      indices: A list of a list of integers denoting indices to replace.
+    Returns:
+      A list of resulting contexts.
+    """
+    if indices is None: indices = [[]] * len(contexts)
+    assert len(contexts) == len(indices)
+    assert all([spec.shape.ndims == 1 for spec in self.context_specs])
+    mix_contexts = []
+    for contexts_, insert_contexts_, indices_, spec in zip(
+        contexts, insert_contexts, indices, self.context_specs):
+      mix_contexts.append(
+          tf.concat(
+              [
+                  insert_contexts_[:, i:i + 1] if i in indices_ else
+                  contexts_[:, i:i + 1] for i in range(spec.shape.as_list()[0])
+              ],
+              axis=1))
+    return mix_contexts
+
+  def begin_episode_ops(self, mode, action_fn=None, state=None):
+    """Returns ops that reset agent at beginning of episodes.
+
+    Args:
+      mode: a string representing the mode=[train, explore, eval].
+    Returns:
+      A list of ops.
+    """
+    all_ops = []
+    for _, action_var in sorted(self._action_vars.items()):
+      sample_action = self.sample_random_actions(1)[0]
+      all_ops.append(tf.assign(action_var, sample_action))
+    all_ops += self.tf_context.reset(mode=mode, agent=self._meta_agent,
+                                     action_fn=action_fn, state=state)
+    return all_ops
+
+  def cond_begin_episode_op(self, cond, input_vars, mode, meta_action_fn):
+    """Returns op that resets agent at beginning of episodes.
+
+    A new episode is begun if the cond op evalues to `False`.
+
+    Args:
+      cond: a Boolean tensor variable.
+      input_vars: A list of tensor variables.
+      mode: a string representing the mode=[train, explore, eval].
+    Returns:
+      Conditional begin op.
+    """
+    (state, action, reward, next_state,
+     state_repr, next_state_repr) = input_vars
+    def continue_fn():
+      """Continue op fn."""
+      items = [state, action, reward, next_state,
+               state_repr, next_state_repr] + list(self.context_vars)
+      batch_items = [tf.expand_dims(item, 0) for item in items]
+      (states, actions, rewards, next_states,
+       state_reprs, next_state_reprs) = batch_items[:6]
+      context_reward = self.compute_rewards(
+          mode, state_reprs, actions, rewards, next_state_reprs,
+          batch_items[6:])[0][0]
+      context_reward = tf.cast(context_reward, dtype=reward.dtype)
+      if self.meta_agent is not None:
+        meta_action = tf.concat(self.context_vars, -1)
+        items = [state, meta_action, reward, next_state,
+                 state_repr, next_state_repr] + list(self.meta_agent.context_vars)
+        batch_items = [tf.expand_dims(item, 0) for item in items]
+        (states, meta_actions, rewards, next_states,
+         state_reprs, next_state_reprs) = batch_items[:6]
+        meta_reward = self.meta_agent.compute_rewards(
+            mode, states, meta_actions, rewards,
+            next_states, batch_items[6:])[0][0]
+        meta_reward = tf.cast(meta_reward, dtype=reward.dtype)
+      else:
+        meta_reward = tf.constant(0, dtype=reward.dtype)
+
+      with tf.control_dependencies([context_reward, meta_reward]):
+        step_ops = self.tf_context.step(mode=mode, agent=self._meta_agent,
+                                        state=state,
+                                        next_state=next_state,
+                                        state_repr=state_repr,
+                                        next_state_repr=next_state_repr,
+                                        action_fn=meta_action_fn)
+      with tf.control_dependencies(step_ops):
+        context_reward, meta_reward = map(tf.identity, [context_reward, meta_reward])
+      return context_reward, meta_reward
+    def begin_episode_fn():
+      """Begin op fn."""
+      begin_ops = self.begin_episode_ops(mode=mode, action_fn=meta_action_fn, state=state)
+      with tf.control_dependencies(begin_ops):
+        return tf.zeros_like(reward), tf.zeros_like(reward)
+    with tf.control_dependencies(input_vars):
+      cond_begin_episode_op = tf.cond(cond, continue_fn, begin_episode_fn)
+    return cond_begin_episode_op
+
+  def get_env_base_wrapper(self, env_base, **begin_kwargs):
+    """Create a wrapper around env_base, with agent-specific begin/end_episode.
+
+    Args:
+      env_base: A python environment base.
+      **begin_kwargs: Keyword args for begin_episode_ops.
+    Returns:
+      An object with begin_episode() and end_episode().
+    """
+    begin_ops = self.begin_episode_ops(**begin_kwargs)
+    return uvf_utils.get_contextual_env_base(env_base, begin_ops)
+
+  def init_action_vars(self, name, i=None):
+    """Create and return a tensorflow Variable holding an action.
+
+    Args:
+      name: Name of the variables.
+      i: Integer id.
+    Returns:
+      A [num_action_dims] tensor.
+    """
+    if i is not None:
+      name += '_%d' % i
+    assert name not in self._action_vars, ('Conflict! %s is already '
+                                           'initialized.') % name
+    self._action_vars[name] = tf.Variable(
+        self.sample_random_actions(1)[0], name='%s_action' % (name))
+    self._validate_actions(tf.expand_dims(self._action_vars[name], 0))
+    return self._action_vars[name]
+
+  @gin.configurable('uvf_critic_function')
+  def critic_function(self, critic_vals, states, critic_fn=None):
+    """Computes q values based on outputs from the critic net.
+
+    Args:
+      critic_vals: A tf.float32 [batch_size, ...] tensor representing outputs
+        from the critic net.
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      critic_fn: A callable that process outputs from critic_net and
+        outputs a [batch_size] tensor representing q values.
+    Returns:
+      A tf.float32 [batch_size] tensor representing q values.
+    """
+    if critic_fn is not None:
+      env_states, contexts = self.unmerged_states(states)
+      critic_vals = critic_fn(critic_vals, env_states, contexts)
+    critic_vals.shape.assert_has_rank(1)
+    return critic_vals
+
+  def get_action_vars(self, key):
+    return self._action_vars[key]
+
+  def get_context_vars(self, key):
+    return self.tf_context.context_vars[key]
+
+  def step_cond_fn(self, *args):
+    return self._step_cond_fn(self, *args)
+
+  def reset_episode_cond_fn(self, *args):
+    return self._reset_episode_cond_fn(self, *args)
+
+  def reset_env_cond_fn(self, *args):
+    return self._reset_env_cond_fn(self, *args)
+
+  @property
+  def context_vars(self):
+    return self.tf_context.vars
+
+
+@gin.configurable
+class MetaAgentCore(UvfAgentCore):
+  """Defines basic functions for UVF Meta-agent. Must be inherited with an RL agent.
+
+  Used as higher-level agent.
+  """
+
+  def __init__(self,
+               observation_spec,
+               action_spec,
+               tf_env,
+               tf_context,
+               sub_context,
+               step_cond_fn=cond_fn.env_transition,
+               reset_episode_cond_fn=cond_fn.env_restart,
+               reset_env_cond_fn=cond_fn.false_fn,
+               metrics=None,
+               actions_reg=0.,
+               k=2,
+               **base_agent_kwargs):
+    """Constructs a Meta agent.
+
+    Args:
+      observation_spec: A TensorSpec defining the observations.
+      action_spec: A BoundedTensorSpec defining the actions.
+      tf_env: A Tensorflow environment object.
+      tf_context: A Context class.
+      step_cond_fn: A function indicating whether to increment the num of steps.
+      reset_episode_cond_fn: A function indicating whether to restart the
+      episode, resampling the context.
+      reset_env_cond_fn: A function indicating whether to perform a manual reset
+      of the environment.
+      metrics: A list of functions that evaluate metrics of the agent.
+      **base_agent_kwargs: A dictionary of parameters for base RL Agent.
+    Raises:
+      ValueError: If 'dqda_clipping' is < 0.
+    """
+    self._step_cond_fn = step_cond_fn
+    self._reset_episode_cond_fn = reset_episode_cond_fn
+    self._reset_env_cond_fn = reset_env_cond_fn
+    self.metrics = metrics
+    self._actions_reg = actions_reg
+    self._k = k
+
+    # expose tf_context methods
+    self.tf_context = tf_context(tf_env=tf_env)
+    self.sub_context = sub_context(tf_env=tf_env)
+    self.set_replay = self.tf_context.set_replay
+    self.sample_contexts = self.tf_context.sample_contexts
+    self.compute_rewards = self.tf_context.compute_rewards
+    self.gamma_index = self.tf_context.gamma_index
+    self.context_specs = self.tf_context.context_specs
+    self.context_as_action_specs = self.tf_context.context_as_action_specs
+    self.sub_context_as_action_specs = self.sub_context.context_as_action_specs
+    self.init_context_vars = self.tf_context.create_vars
+
+    self.env_observation_spec = observation_spec[0]
+    merged_observation_spec = (uvf_utils.merge_specs(
+        (self.env_observation_spec,) + self.context_specs),)
+    self._context_vars = dict()
+    self._action_vars = dict()
+
+    assert len(self.context_as_action_specs) == 1
+    self.BASE_AGENT_CLASS.__init__(
+        self,
+        observation_spec=merged_observation_spec,
+        action_spec=self.sub_context_as_action_specs,
+        **base_agent_kwargs
+    )
+
+  @gin.configurable('meta_add_noise_fn')
+  def add_noise_fn(self, action_fn, stddev=1.0, debug=False,
+                   global_step=None):
+    noisy_action_fn = super(MetaAgentCore, self).add_noise_fn(
+        action_fn, stddev,
+        clip=True, global_step=global_step)
+    return noisy_action_fn
+
+  def actor_loss(self, states, actions, rewards, discounts,
+                 next_states):
+    """Returns the next action for the state.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+      context: A list of [num_context_dims] tensor representing a context.
+    Returns:
+      A [num_action_dims] tensor representing the action.
+    """
+    actions = self.actor_net(states, stop_gradients=False)
+    regularizer = self._actions_reg * tf.reduce_mean(
+        tf.reduce_sum(tf.abs(actions[:, self._k:]), -1), 0)
+    loss = self.BASE_AGENT_CLASS.actor_loss(self, states)
+    return regularizer + loss
+
+
+@gin.configurable
+class UvfAgent(UvfAgentCore, ddpg_agent.TD3Agent):
+  """A DDPG agent with UVF.
+  """
+  BASE_AGENT_CLASS = ddpg_agent.TD3Agent
+  ACTION_TYPE = 'continuous'
+
+  def __init__(self, *args, **kwargs):
+    UvfAgentCore.__init__(self, *args, **kwargs)
+
+
+@gin.configurable
+class MetaAgent(MetaAgentCore, ddpg_agent.TD3Agent):
+  """A DDPG meta-agent.
+  """
+  BASE_AGENT_CLASS = ddpg_agent.TD3Agent
+  ACTION_TYPE = 'continuous'
+
+  def __init__(self, *args, **kwargs):
+    MetaAgentCore.__init__(self, *args, **kwargs)
+
+
+@gin.configurable()
+def state_preprocess_net(
+    states,
+    num_output_dims=2,
+    states_hidden_layers=(100,),
+    normalizer_fn=None,
+    activation_fn=tf.nn.relu,
+    zero_time=True,
+    images=False):
+  """Creates a simple feed forward net for embedding states.
+  """
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+
+    states_shape = tf.shape(states)
+    states_dtype = states.dtype
+    states = tf.to_float(states)
+    if images:  # Zero-out x-y
+      states *= tf.constant([0.] * 2 + [1.] * (states.shape[-1] - 2), dtype=states.dtype)
+    if zero_time:
+      states *= tf.constant([1.] * (states.shape[-1] - 1) + [0.], dtype=states.dtype)
+    orig_states = states
+    embed = states
+    if states_hidden_layers:
+      embed = slim.stack(embed, slim.fully_connected, states_hidden_layers,
+                         scope='states')
+
+    with slim.arg_scope([slim.fully_connected],
+                        weights_regularizer=None,
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      embed = slim.fully_connected(embed, num_output_dims,
+                                   activation_fn=None,
+                                   normalizer_fn=None,
+                                   scope='value')
+
+    output = embed
+    output = tf.cast(output, states_dtype)
+    return output
+
+
+@gin.configurable()
+def action_embed_net(
+    actions,
+    states=None,
+    num_output_dims=2,
+    hidden_layers=(400, 300),
+    normalizer_fn=None,
+    activation_fn=tf.nn.relu,
+    zero_time=True,
+    images=False):
+  """Creates a simple feed forward net for embedding actions.
+  """
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+
+    actions = tf.to_float(actions)
+    if states is not None:
+      if images:  # Zero-out x-y
+        states *= tf.constant([0.] * 2 + [1.] * (states.shape[-1] - 2), dtype=states.dtype)
+      if zero_time:
+        states *= tf.constant([1.] * (states.shape[-1] - 1) + [0.], dtype=states.dtype)
+      actions = tf.concat([actions, tf.to_float(states)], -1)
+
+    embed = actions
+    if hidden_layers:
+      embed = slim.stack(embed, slim.fully_connected, hidden_layers,
+                         scope='hidden')
+
+    with slim.arg_scope([slim.fully_connected],
+                        weights_regularizer=None,
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      embed = slim.fully_connected(embed, num_output_dims,
+                                   activation_fn=None,
+                                   normalizer_fn=None,
+                                   scope='value')
+      if num_output_dims == 1:
+        return embed[:, 0, ...]
+      else:
+        return embed
+
+
+def huber(x, kappa=0.1):
+  return (0.5 * tf.square(x) * tf.to_float(tf.abs(x) <= kappa) +
+          kappa * (tf.abs(x) - 0.5 * kappa) * tf.to_float(tf.abs(x) > kappa)
+          ) / kappa
+
+
+@gin.configurable()
+class StatePreprocess(object):
+  STATE_PREPROCESS_NET_SCOPE = 'state_process_net'
+  ACTION_EMBED_NET_SCOPE = 'action_embed_net'
+
+  def __init__(self, trainable=False,
+               state_preprocess_net=lambda states: states,
+               action_embed_net=lambda actions, *args, **kwargs: actions,
+               ndims=None):
+    self.trainable = trainable
+    self._scope = tf.get_variable_scope().name
+    self._ndims = ndims
+    self._state_preprocess_net = tf.make_template(
+        self.STATE_PREPROCESS_NET_SCOPE, state_preprocess_net,
+        create_scope_now_=True)
+    self._action_embed_net = tf.make_template(
+        self.ACTION_EMBED_NET_SCOPE, action_embed_net,
+        create_scope_now_=True)
+
+  def __call__(self, states):
+    batched = states.get_shape().ndims != 1
+    if not batched:
+      states = tf.expand_dims(states, 0)
+    embedded = self._state_preprocess_net(states)
+    if self._ndims is not None:
+      embedded = embedded[..., :self._ndims]
+    if not batched:
+      return embedded[0]
+    return embedded
+
+  def loss(self, states, next_states, low_actions, low_states):
+    batch_size = tf.shape(states)[0]
+    d = int(low_states.shape[1])
+    # Sample indices into meta-transition to train on.
+    probs = 0.99 ** tf.range(d, dtype=tf.float32)
+    probs *= tf.constant([1.0] * (d - 1) + [1.0 / (1 - 0.99)],
+                         dtype=tf.float32)
+    probs /= tf.reduce_sum(probs)
+    index_dist = tf.distributions.Categorical(probs=probs, dtype=tf.int64)
+    indices = index_dist.sample(batch_size)
+    batch_size = tf.cast(batch_size, tf.int64)
+    next_indices = tf.concat(
+        [tf.range(batch_size, dtype=tf.int64)[:, None],
+         (1 + indices[:, None]) % d], -1)
+    new_next_states = tf.where(indices < d - 1,
+                               tf.gather_nd(low_states, next_indices),
+                               next_states)
+    next_states = new_next_states
+
+    embed1 = tf.to_float(self._state_preprocess_net(states))
+    embed2 = tf.to_float(self._state_preprocess_net(next_states))
+    action_embed = self._action_embed_net(
+        tf.layers.flatten(low_actions), states=states)
+
+    tau = 2.0
+    fn = lambda z: tau * tf.reduce_sum(huber(z), -1)
+    all_embed = tf.get_variable('all_embed', [1024, int(embed1.shape[-1])],
+                                initializer=tf.zeros_initializer())
+    upd = all_embed.assign(tf.concat([all_embed[batch_size:], embed2], 0))
+    with tf.control_dependencies([upd]):
+      close = 1 * tf.reduce_mean(fn(embed1 + action_embed - embed2))
+      prior_log_probs = tf.reduce_logsumexp(
+          -fn((embed1 + action_embed)[:, None, :] - all_embed[None, :, :]),
+          axis=-1) - tf.log(tf.to_float(all_embed.shape[0]))
+      far = tf.reduce_mean(tf.exp(-fn((embed1 + action_embed)[1:] - embed2[:-1])
+                                  - tf.stop_gradient(prior_log_probs[1:])))
+      repr_log_probs = tf.stop_gradient(
+          -fn(embed1 + action_embed - embed2) - prior_log_probs) / tau
+    return close + far, repr_log_probs, indices
+
+  def get_trainable_vars(self):
+    return (
+        slim.get_trainable_variables(
+            uvf_utils.join_scope(self._scope, self.STATE_PREPROCESS_NET_SCOPE)) +
+        slim.get_trainable_variables(
+            uvf_utils.join_scope(self._scope, self.ACTION_EMBED_NET_SCOPE)))
+
+
+@gin.configurable()
+class InverseDynamics(object):
+  INVERSE_DYNAMICS_NET_SCOPE = 'inverse_dynamics'
+
+  def __init__(self, spec):
+    self._spec = spec
+
+  def sample(self, states, next_states, num_samples, orig_goals, sc=0.5):
+    goal_dim = orig_goals.shape[-1]
+    spec_range = (self._spec.maximum - self._spec.minimum) / 2 * tf.ones([goal_dim])
+    loc = tf.cast(next_states - states, tf.float32)[:, :goal_dim]
+    scale = sc * tf.tile(tf.reshape(spec_range, [1, goal_dim]),
+                         [tf.shape(states)[0], 1])
+    dist = tf.distributions.Normal(loc, scale)
+    if num_samples == 1:
+      return dist.sample()
+    samples = tf.concat([dist.sample(num_samples - 2),
+                         tf.expand_dims(loc, 0),
+                         tf.expand_dims(orig_goals, 0)], 0)
+    return uvf_utils.clip_to_spec(samples, self._spec)
--- a/research/efficient-hrl/agents/__init__.py
+++ b/research/efficient-hrl/agents/__init__.py
+
--- a/research/efficient-hrl/agents/circular_buffer.py
+++ b/research/efficient-hrl/agents/circular_buffer.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A circular buffer where each element is a list of tensors.
+
+Each element of the buffer is a list of tensors. An example use case is a replay
+buffer in reinforcement learning, where each element is a list of tensors
+representing the state, action, reward etc.
+
+New elements are added sequentially, and once the buffer is full, we
+start overwriting them in a circular fashion. Reading does not remove any
+elements, only adding new elements does.
+"""
+
+import collections
+import numpy as np
+import tensorflow as tf
+
+import gin.tf
+
+
+@gin.configurable
+class CircularBuffer(object):
+  """A circular buffer where each element is a list of tensors."""
+
+  def __init__(self, buffer_size=1000, scope='replay_buffer'):
+    """Circular buffer of list of tensors.
+
+    Args:
+      buffer_size: (integer) maximum number of tensor lists the buffer can hold.
+      scope: (string) variable scope for creating the variables.
+    """
+    self._buffer_size = np.int64(buffer_size)
+    self._scope = scope
+    self._tensors = collections.OrderedDict()
+    with tf.variable_scope(self._scope):
+      self._num_adds = tf.Variable(0, dtype=tf.int64, name='num_adds')
+    self._num_adds_cs = tf.contrib.framework.CriticalSection(name='num_adds')
+
+  @property
+  def buffer_size(self):
+    return self._buffer_size
+
+  @property
+  def scope(self):
+    return self._scope
+
+  @property
+  def num_adds(self):
+    return self._num_adds
+
+  def _create_variables(self, tensors):
+    with tf.variable_scope(self._scope):
+      for name in tensors.keys():
+        tensor = tensors[name]
+        self._tensors[name] = tf.get_variable(
+            name='BufferVariable_' + name,
+            shape=[self._buffer_size] + tensor.get_shape().as_list(),
+            dtype=tensor.dtype,
+            trainable=False)
+
+  def _validate(self, tensors):
+    """Validate shapes of tensors."""
+    if len(tensors) != len(self._tensors):
+      raise ValueError('Expected tensors to have %d elements. Received %d '
+                       'instead.' % (len(self._tensors), len(tensors)))
+    if self._tensors.keys() != tensors.keys():
+      raise ValueError('The keys of tensors should be the always the same.'
+                       'Received %s instead %s.' %
+                       (tensors.keys(), self._tensors.keys()))
+    for name, tensor in tensors.items():
+      if tensor.get_shape().as_list() != self._tensors[
+          name].get_shape().as_list()[1:]:
+        raise ValueError('Tensor %s has incorrect shape.' % name)
+      if not tensor.dtype.is_compatible_with(self._tensors[name].dtype):
+        raise ValueError(
+            'Tensor %s has incorrect data type. Expected %s, received %s' %
+            (name, self._tensors[name].read_value().dtype, tensor.dtype))
+
+  def add(self, tensors):
+    """Adds an element (list/tuple/dict of tensors) to the buffer.
+
+    Args:
+      tensors: (list/tuple/dict of tensors) to be added to the buffer.
+    Returns:
+      An add operation that adds the input `tensors` to the buffer. Similar to
+        an enqueue_op.
+    Raises:
+      ValueError: If the shapes and data types of input `tensors' are not the
+        same across calls to the add function.
+    """
+    return self.maybe_add(tensors, True)
+
+  def maybe_add(self, tensors, condition):
+    """Adds an element (tensors) to the buffer based on the condition..
+
+    Args:
+      tensors: (list/tuple of tensors) to be added to the buffer.
+      condition: A boolean Tensor controlling whether the tensors would be added
+        to the buffer or not.
+    Returns:
+      An add operation that adds the input `tensors` to the buffer. Similar to
+        an maybe_enqueue_op.
+    Raises:
+      ValueError: If the shapes and data types of input `tensors' are not the
+        same across calls to the add function.
+    """
+    if not isinstance(tensors, dict):
+      names = [str(i) for i in range(len(tensors))]
+      tensors = collections.OrderedDict(zip(names, tensors))
+    if not isinstance(tensors, collections.OrderedDict):
+      tensors = collections.OrderedDict(
+          sorted(tensors.items(), key=lambda t: t[0]))
+    if not self._tensors:
+      self._create_variables(tensors)
+    else:
+      self._validate(tensors)
+
+    #@tf.critical_section(self._position_mutex)
+    def _increment_num_adds():
+      # Adding 0 to the num_adds variable is a trick to read the value of the
+      # variable and return a read-only tensor. Doing this in a critical
+      # section allows us to capture a snapshot of the variable that will
+      # not be affected by other threads updating num_adds.
+      return self._num_adds.assign_add(1) + 0
+    def _add():
+      num_adds_inc = self._num_adds_cs.execute(_increment_num_adds)
+      current_pos = tf.mod(num_adds_inc - 1, self._buffer_size)
+      update_ops = []
+      for name in self._tensors.keys():
+        update_ops.append(
+            tf.scatter_update(self._tensors[name], current_pos, tensors[name]))
+      return tf.group(*update_ops)
+
+    return tf.contrib.framework.smart_cond(condition, _add, tf.no_op)
+
+  def get_random_batch(self, batch_size, keys=None, num_steps=1):
+    """Samples a batch of tensors from the buffer with replacement.
+
+    Args:
+      batch_size: (integer) number of elements to sample.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+      num_steps: (integer) length of trajectories to return. If > 1 will return
+        a list of lists, where each internal list represents a trajectory of
+        length num_steps.
+    Returns:
+      A list of tensors, where each element in the list is a batch sampled from
+        one of the tensors in the buffer.
+    Raises:
+      ValueError: If get_random_batch is called before calling the add function.
+      tf.errors.InvalidArgumentError: If this operation is executed before any
+        items are added to the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before get_random_batch.')
+    if keys is None:
+      keys = self._tensors.keys()
+
+    latest_start_index = self.get_num_adds() - num_steps + 1
+    empty_buffer_assert = tf.Assert(
+        tf.greater(latest_start_index, 0),
+        ['Not enough elements have been added to the buffer.'])
+    with tf.control_dependencies([empty_buffer_assert]):
+      max_index = tf.minimum(self._buffer_size, latest_start_index)
+      indices = tf.random_uniform(
+          [batch_size],
+          minval=0,
+          maxval=max_index,
+          dtype=tf.int64)
+      if num_steps == 1:
+        return self.gather(indices, keys)
+      else:
+        return self.gather_nstep(num_steps, indices, keys)
+
+  def gather(self, indices, keys=None):
+    """Returns elements at the specified indices from the buffer.
+
+    Args:
+      indices: (list of integers or rank 1 int Tensor) indices in the buffer to
+        retrieve elements from.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+    Returns:
+      A list of tensors, where each element in the list is obtained by indexing
+        one of the tensors in the buffer.
+    Raises:
+      ValueError: If gather is called before calling the add function.
+      tf.errors.InvalidArgumentError: If indices are bigger than the number of
+        items in the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before calling gather.')
+    if keys is None:
+      keys = self._tensors.keys()
+    with tf.name_scope('Gather'):
+      index_bound_assert = tf.Assert(
+          tf.less(
+              tf.to_int64(tf.reduce_max(indices)),
+              tf.minimum(self.get_num_adds(), self._buffer_size)),
+          ['Index out of bounds.'])
+      with tf.control_dependencies([index_bound_assert]):
+        indices = tf.convert_to_tensor(indices)
+
+      batch = []
+      for key in keys:
+        batch.append(tf.gather(self._tensors[key], indices, name=key))
+      return batch
+
+  def gather_nstep(self, num_steps, indices, keys=None):
+    """Returns elements at the specified indices from the buffer.
+
+    Args:
+      num_steps: (integer) length of trajectories to return.
+      indices: (list of rank num_steps int Tensor) indices in the buffer to
+        retrieve elements from for multiple trajectories. Each Tensor in the
+        list represents the indices for a trajectory.
+      keys: List of keys of tensors to retrieve. If None retrieve all.
+    Returns:
+      A list of list-of-tensors, where each element in the list is obtained by
+        indexing one of the tensors in the buffer.
+    Raises:
+      ValueError: If gather is called before calling the add function.
+      tf.errors.InvalidArgumentError: If indices are bigger than the number of
+        items in the buffer.
+    """
+    if not self._tensors:
+      raise ValueError('The add function must be called before calling gather.')
+    if keys is None:
+      keys = self._tensors.keys()
+    with tf.name_scope('Gather'):
+      index_bound_assert = tf.Assert(
+          tf.less_equal(
+              tf.to_int64(tf.reduce_max(indices) + num_steps),
+              self.get_num_adds()),
+          ['Trajectory indices go out of bounds.'])
+      with tf.control_dependencies([index_bound_assert]):
+        indices = tf.map_fn(
+            lambda x: tf.mod(tf.range(x, x + num_steps), self._buffer_size),
+            indices,
+            dtype=tf.int64)
+
+      batch = []
+      for key in keys:
+
+        def SampleTrajectories(trajectory_indices, key=key,
+                               num_steps=num_steps):
+          trajectory_indices.set_shape([num_steps])
+          return tf.gather(self._tensors[key], trajectory_indices, name=key)
+
+        batch.append(tf.map_fn(SampleTrajectories, indices,
+                               dtype=self._tensors[key].dtype))
+      return batch
+
+  def get_position(self):
+    """Returns the position at which the last element was added.
+
+    Returns:
+      An int tensor representing the index at which the last element was added
+        to the buffer or -1 if no elements were added.
+    """
+    return tf.cond(self.get_num_adds() < 1,
+                   lambda: self.get_num_adds() - 1,
+                   lambda: tf.mod(self.get_num_adds() - 1, self._buffer_size))
+
+  def get_num_adds(self):
+    """Returns the number of additions to the buffer.
+
+    Returns:
+      An int tensor representing the number of elements that were added.
+    """
+    def num_adds():
+      return self._num_adds.value()
+
+    return self._num_adds_cs.execute(num_adds)
+
+  def get_num_tensors(self):
+    """Returns the number of tensors (slots) in the buffer."""
+    return len(self._tensors)
--- a/research/efficient-hrl/agents/ddpg_agent.py
+++ b/research/efficient-hrl/agents/ddpg_agent.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A DDPG/NAF agent.
+
+Implements the Deep Deterministic Policy Gradient (DDPG) algorithm from
+"Continuous control with deep reinforcement learning" - Lilicrap et al.
+https://arxiv.org/abs/1509.02971, and the Normalized Advantage Functions (NAF)
+algorithm "Continuous Deep Q-Learning with Model-based Acceleration" - Gu et al.
+https://arxiv.org/pdf/1603.00748.
+"""
+
+import tensorflow as tf
+slim = tf.contrib.slim
+import gin.tf
+from utils import utils
+from agents import ddpg_networks as networks
+
+
+@gin.configurable
+class DdpgAgent(object):
+  """An RL agent that learns using the DDPG algorithm.
+
+  Example usage:
+
+  def critic_net(states, actions):
+    ...
+  def actor_net(states, num_action_dims):
+    ...
+
+  Given a tensorflow environment tf_env,
+  (of type learning.deepmind.rl.environments.tensorflow.python.tfpyenvironment)
+
+  obs_spec = tf_env.observation_spec()
+  action_spec = tf_env.action_spec()
+
+  ddpg_agent = agent.DdpgAgent(obs_spec,
+                               action_spec,
+                               actor_net=actor_net,
+                               critic_net=critic_net)
+
+  we can perform actions on the environment as follows:
+
+  state = tf_env.observations()[0]
+  action = ddpg_agent.actor_net(tf.expand_dims(state, 0))[0, :]
+  transition_type, reward, discount = tf_env.step([action])
+
+  Train:
+
+  critic_loss = ddpg_agent.critic_loss(states, actions, rewards, discounts,
+                                       next_states)
+  actor_loss = ddpg_agent.actor_loss(states)
+
+  critic_train_op = slim.learning.create_train_op(
+      critic_loss,
+      critic_optimizer,
+      variables_to_train=ddpg_agent.get_trainable_critic_vars(),
+  )
+
+  actor_train_op = slim.learning.create_train_op(
+      actor_loss,
+      actor_optimizer,
+      variables_to_train=ddpg_agent.get_trainable_actor_vars(),
+  )
+  """
+
+  ACTOR_NET_SCOPE = 'actor_net'
+  CRITIC_NET_SCOPE = 'critic_net'
+  TARGET_ACTOR_NET_SCOPE = 'target_actor_net'
+  TARGET_CRITIC_NET_SCOPE = 'target_critic_net'
+
+  def __init__(self,
+               observation_spec,
+               action_spec,
+               actor_net=networks.actor_net,
+               critic_net=networks.critic_net,
+               td_errors_loss=tf.losses.huber_loss,
+               dqda_clipping=0.,
+               actions_regularizer=0.,
+               target_q_clipping=None,
+               residual_phi=0.0,
+               debug_summaries=False):
+    """Constructs a DDPG agent.
+
+    Args:
+      observation_spec: A TensorSpec defining the observations.
+      action_spec: A BoundedTensorSpec defining the actions.
+      actor_net: A callable that creates the actor network. Must take the
+        following arguments: states, num_actions. Please see networks.actor_net
+        for an example.
+      critic_net: A callable that creates the critic network. Must take the
+        following arguments: states, actions. Please see networks.critic_net
+        for an example.
+      td_errors_loss: A callable defining the loss function for the critic
+        td error.
+      dqda_clipping: (float) clips the gradient dqda element-wise between
+        [-dqda_clipping, dqda_clipping]. Does not perform clipping if
+        dqda_clipping == 0.
+      actions_regularizer: A scalar, when positive penalizes the norm of the
+        actions. This can prevent saturation of actions for the actor_loss.
+      target_q_clipping: (tuple of floats) clips target q values within
+        (low, high) values when computing the critic loss.
+      residual_phi: (float) [0.0, 1.0] Residual algorithm parameter that
+        interpolates between Q-learning and residual gradient algorithm.
+        http://www.leemon.com/papers/1995b.pdf
+      debug_summaries: If True, add summaries to help debug behavior.
+    Raises:
+      ValueError: If 'dqda_clipping' is < 0.
+    """
+    self._observation_spec = observation_spec[0]
+    self._action_spec = action_spec[0]
+    self._state_shape = tf.TensorShape([None]).concatenate(
+        self._observation_spec.shape)
+    self._action_shape = tf.TensorShape([None]).concatenate(
+        self._action_spec.shape)
+    self._num_action_dims = self._action_spec.shape.num_elements()
+
+    self._scope = tf.get_variable_scope().name
+    self._actor_net = tf.make_template(
+        self.ACTOR_NET_SCOPE, actor_net, create_scope_now_=True)
+    self._critic_net = tf.make_template(
+        self.CRITIC_NET_SCOPE, critic_net, create_scope_now_=True)
+    self._target_actor_net = tf.make_template(
+        self.TARGET_ACTOR_NET_SCOPE, actor_net, create_scope_now_=True)
+    self._target_critic_net = tf.make_template(
+        self.TARGET_CRITIC_NET_SCOPE, critic_net, create_scope_now_=True)
+    self._td_errors_loss = td_errors_loss
+    if dqda_clipping < 0:
+      raise ValueError('dqda_clipping must be >= 0.')
+    self._dqda_clipping = dqda_clipping
+    self._actions_regularizer = actions_regularizer
+    self._target_q_clipping = target_q_clipping
+    self._residual_phi = residual_phi
+    self._debug_summaries = debug_summaries
+
+  def _batch_state(self, state):
+    """Convert state to a batched state.
+
+    Args:
+      state: Either a list/tuple with an state tensor [num_state_dims].
+    Returns:
+      A tensor [1, num_state_dims]
+    """
+    if isinstance(state, (tuple, list)):
+      state = state[0]
+    if state.get_shape().ndims == 1:
+      state = tf.expand_dims(state, 0)
+    return state
+
+  def action(self, state):
+    """Returns the next action for the state.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+    Returns:
+      A [num_action_dims] tensor representing the action.
+    """
+    return self.actor_net(self._batch_state(state), stop_gradients=True)[0, :]
+
+  @gin.configurable('ddpg_sample_action')
+  def sample_action(self, state, stddev=1.0):
+    """Returns the action for the state with additive noise.
+
+    Args:
+      state: A [num_state_dims] tensor representing a state.
+      stddev: stddev for the Ornstein-Uhlenbeck noise.
+    Returns:
+      A [num_action_dims] action tensor.
+    """
+    agent_action = self.action(state)
+    agent_action += tf.random_normal(tf.shape(agent_action)) * stddev
+    return utils.clip_to_spec(agent_action, self._action_spec)
+
+  def actor_net(self, states, stop_gradients=False):
+    """Returns the output of the actor network.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      stop_gradients: (boolean) if true, gradients cannot be propogated through
+        this operation.
+    Returns:
+      A [batch_size, num_action_dims] tensor of actions.
+    Raises:
+      ValueError: If `states` does not have the expected dimensions.
+    """
+    self._validate_states(states)
+    actions = self._actor_net(states, self._action_spec)
+    if stop_gradients:
+      actions = tf.stop_gradient(actions)
+    return actions
+
+  def critic_net(self, states, actions, for_critic_loss=False):
+    """Returns the output of the critic network.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      actions: A [batch_size, num_action_dims] tensor representing a batch
+        of actions.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    Raises:
+      ValueError: If `states` or `actions' do not have the expected dimensions.
+    """
+    self._validate_states(states)
+    self._validate_actions(actions)
+    return self._critic_net(states, actions,
+                            for_critic_loss=for_critic_loss)
+
+  def target_actor_net(self, states):
+    """Returns the output of the target actor network.
+
+    The target network is used to compute stable targets for training.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      A [batch_size, num_action_dims] tensor of actions.
+    Raises:
+      ValueError: If `states` does not have the expected dimensions.
+    """
+    self._validate_states(states)
+    actions = self._target_actor_net(states, self._action_spec)
+    return tf.stop_gradient(actions)
+
+  def target_critic_net(self, states, actions, for_critic_loss=False):
+    """Returns the output of the target critic network.
+
+    The target network is used to compute stable targets for training.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      actions: A [batch_size, num_action_dims] tensor representing a batch
+        of actions.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    Raises:
+      ValueError: If `states` or `actions' do not have the expected dimensions.
+    """
+    self._validate_states(states)
+    self._validate_actions(actions)
+    return tf.stop_gradient(
+        self._target_critic_net(states, actions,
+                                for_critic_loss=for_critic_loss))
+
+  def value_net(self, states, for_critic_loss=False):
+    """Returns the output of the critic evaluated with the actor.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    """
+    actions = self.actor_net(states)
+    return self.critic_net(states, actions,
+                           for_critic_loss=for_critic_loss)
+
+  def target_value_net(self, states, for_critic_loss=False):
+    """Returns the output of the target critic evaluated with the target actor.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    """
+    target_actions = self.target_actor_net(states)
+    return self.target_critic_net(states, target_actions,
+                                  for_critic_loss=for_critic_loss)
+
+  def critic_loss(self, states, actions, rewards, discounts,
+                  next_states):
+    """Computes a loss for training the critic network.
+
+    The loss is the mean squared error between the Q value predictions of the
+    critic and Q values estimated using TD-lambda.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      actions: A [batch_size, num_action_dims] tensor representing a batch
+        of actions.
+      rewards: A [batch_size, ...] tensor representing a batch of rewards,
+        broadcastable to the critic net output.
+      discounts: A [batch_size, ...] tensor representing a batch of discounts,
+        broadcastable to the critic net output.
+      next_states: A [batch_size, num_state_dims] tensor representing a batch
+        of next states.
+    Returns:
+      A rank-0 tensor representing the critic loss.
+    Raises:
+      ValueError: If any of the inputs do not have the expected dimensions, or
+        if their batch_sizes do not match.
+    """
+    self._validate_states(states)
+    self._validate_actions(actions)
+    self._validate_states(next_states)
+
+    target_q_values = self.target_value_net(next_states, for_critic_loss=True)
+    td_targets = target_q_values * discounts + rewards
+    if self._target_q_clipping is not None:
+      td_targets = tf.clip_by_value(td_targets, self._target_q_clipping[0],
+                                    self._target_q_clipping[1])
+    q_values = self.critic_net(states, actions, for_critic_loss=True)
+    td_errors = td_targets - q_values
+    if self._debug_summaries:
+      gen_debug_td_error_summaries(
+          target_q_values, q_values, td_targets, td_errors)
+
+    loss = self._td_errors_loss(td_targets, q_values)
+
+    if self._residual_phi > 0.0:  # compute residual gradient loss
+      residual_q_values = self.value_net(next_states, for_critic_loss=True)
+      residual_td_targets = residual_q_values * discounts + rewards
+      if self._target_q_clipping is not None:
+        residual_td_targets = tf.clip_by_value(residual_td_targets,
+                                               self._target_q_clipping[0],
+                                               self._target_q_clipping[1])
+      residual_td_errors = residual_td_targets - q_values
+      residual_loss = self._td_errors_loss(
+          residual_td_targets, residual_q_values)
+      loss = (loss * (1.0 - self._residual_phi) +
+              residual_loss * self._residual_phi)
+    return loss
+
+  def actor_loss(self, states):
+    """Computes a loss for training the actor network.
+
+    Note that output does not represent an actual loss. It is called a loss only
+    in the sense that its gradient w.r.t. the actor network weights is the
+    correct gradient for training the actor network,
+    i.e. dloss/dweights = (dq/da)*(da/dweights)
+    which is the gradient used in Algorithm 1 of Lilicrap et al.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      A rank-0 tensor representing the actor loss.
+    Raises:
+      ValueError: If `states` does not have the expected dimensions.
+    """
+    self._validate_states(states)
+    actions = self.actor_net(states, stop_gradients=False)
+    critic_values = self.critic_net(states, actions)
+    q_values = self.critic_function(critic_values, states)
+    dqda = tf.gradients([q_values], [actions])[0]
+    dqda_unclipped = dqda
+    if self._dqda_clipping > 0:
+      dqda = tf.clip_by_value(dqda, -self._dqda_clipping, self._dqda_clipping)
+
+    actions_norm = tf.norm(actions)
+    if self._debug_summaries:
+      with tf.name_scope('dqda'):
+        tf.summary.scalar('actions_norm', actions_norm)
+        tf.summary.histogram('dqda', dqda)
+        tf.summary.histogram('dqda_unclipped', dqda_unclipped)
+        tf.summary.histogram('actions', actions)
+        for a in range(self._num_action_dims):
+          tf.summary.histogram('dqda_unclipped_%d' % a, dqda_unclipped[:, a])
+          tf.summary.histogram('dqda_%d' % a, dqda[:, a])
+
+    actions_norm *= self._actions_regularizer
+    return slim.losses.mean_squared_error(tf.stop_gradient(dqda + actions),
+                                          actions,
+                                          scope='actor_loss') + actions_norm
+
+  @gin.configurable('ddpg_critic_function')
+  def critic_function(self, critic_values, states, weights=None):
+    """Computes q values based on critic_net outputs, states, and weights.
+
+    Args:
+      critic_values: A tf.float32 [batch_size, ...] tensor representing outputs
+        from the critic net.
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      weights: A list or Numpy array or tensor with a shape broadcastable to
+        `critic_values`.
+    Returns:
+      A tf.float32 [batch_size] tensor representing q values.
+    """
+    del states  # unused args
+    if weights is not None:
+      weights = tf.convert_to_tensor(weights, dtype=critic_values.dtype)
+      critic_values *= weights
+    if critic_values.shape.ndims > 1:
+      critic_values = tf.reduce_sum(critic_values,
+                                    range(1, critic_values.shape.ndims))
+    critic_values.shape.assert_has_rank(1)
+    return critic_values
+
+  @gin.configurable('ddpg_update_targets')
+  def update_targets(self, tau=1.0):
+    """Performs a soft update of the target network parameters.
+
+    For each weight w_s in the actor/critic networks, and its corresponding
+    weight w_t in the target actor/critic networks, a soft update is:
+    w_t = (1- tau) x w_t + tau x ws
+
+    Args:
+      tau: A float scalar in [0, 1]
+    Returns:
+      An operation that performs a soft update of the target network parameters.
+    Raises:
+      ValueError: If `tau` is not in [0, 1].
+    """
+    if tau < 0 or tau > 1:
+      raise ValueError('Input `tau` should be in [0, 1].')
+    update_actor = utils.soft_variables_update(
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.ACTOR_NET_SCOPE)),
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.TARGET_ACTOR_NET_SCOPE)),
+        tau)
+    update_critic = utils.soft_variables_update(
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.CRITIC_NET_SCOPE)),
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.TARGET_CRITIC_NET_SCOPE)),
+        tau)
+    return tf.group(update_actor, update_critic, name='update_targets')
+
+  def get_trainable_critic_vars(self):
+    """Returns a list of trainable variables in the critic network.
+
+    Returns:
+      A list of trainable variables in the critic network.
+    """
+    return slim.get_trainable_variables(
+        utils.join_scope(self._scope, self.CRITIC_NET_SCOPE))
+
+  def get_trainable_actor_vars(self):
+    """Returns a list of trainable variables in the actor network.
+
+    Returns:
+      A list of trainable variables in the actor network.
+    """
+    return slim.get_trainable_variables(
+        utils.join_scope(self._scope, self.ACTOR_NET_SCOPE))
+
+  def get_critic_vars(self):
+    """Returns a list of all variables in the critic network.
+
+    Returns:
+      A list of trainable variables in the critic network.
+    """
+    return slim.get_model_variables(
+        utils.join_scope(self._scope, self.CRITIC_NET_SCOPE))
+
+  def get_actor_vars(self):
+    """Returns a list of all variables in the actor network.
+
+    Returns:
+      A list of trainable variables in the actor network.
+    """
+    return slim.get_model_variables(
+        utils.join_scope(self._scope, self.ACTOR_NET_SCOPE))
+
+  def _validate_states(self, states):
+    """Raises a value error if `states` does not have the expected shape.
+
+    Args:
+      states: A tensor.
+    Raises:
+      ValueError: If states.shape or states.dtype are not compatible with
+        observation_spec.
+    """
+    states.shape.assert_is_compatible_with(self._state_shape)
+    if not states.dtype.is_compatible_with(self._observation_spec.dtype):
+      raise ValueError('states.dtype={} is not compatible with'
+                       ' observation_spec.dtype={}'.format(
+                           states.dtype, self._observation_spec.dtype))
+
+  def _validate_actions(self, actions):
+    """Raises a value error if `actions` does not have the expected shape.
+
+    Args:
+      actions: A tensor.
+    Raises:
+      ValueError: If actions.shape or actions.dtype are not compatible with
+        action_spec.
+    """
+    actions.shape.assert_is_compatible_with(self._action_shape)
+    if not actions.dtype.is_compatible_with(self._action_spec.dtype):
+      raise ValueError('actions.dtype={} is not compatible with'
+                       ' action_spec.dtype={}'.format(
+                           actions.dtype, self._action_spec.dtype))
+
+
+@gin.configurable
+class TD3Agent(DdpgAgent):
+  """An RL agent that learns using the TD3 algorithm."""
+
+  ACTOR_NET_SCOPE = 'actor_net'
+  CRITIC_NET_SCOPE = 'critic_net'
+  CRITIC_NET2_SCOPE = 'critic_net2'
+  TARGET_ACTOR_NET_SCOPE = 'target_actor_net'
+  TARGET_CRITIC_NET_SCOPE = 'target_critic_net'
+  TARGET_CRITIC_NET2_SCOPE = 'target_critic_net2'
+
+  def __init__(self,
+               observation_spec,
+               action_spec,
+               actor_net=networks.actor_net,
+               critic_net=networks.critic_net,
+               td_errors_loss=tf.losses.huber_loss,
+               dqda_clipping=0.,
+               actions_regularizer=0.,
+               target_q_clipping=None,
+               residual_phi=0.0,
+               debug_summaries=False):
+    """Constructs a TD3 agent.
+
+    Args:
+      observation_spec: A TensorSpec defining the observations.
+      action_spec: A BoundedTensorSpec defining the actions.
+      actor_net: A callable that creates the actor network. Must take the
+        following arguments: states, num_actions. Please see networks.actor_net
+        for an example.
+      critic_net: A callable that creates the critic network. Must take the
+        following arguments: states, actions. Please see networks.critic_net
+        for an example.
+      td_errors_loss: A callable defining the loss function for the critic
+        td error.
+      dqda_clipping: (float) clips the gradient dqda element-wise between
+        [-dqda_clipping, dqda_clipping]. Does not perform clipping if
+        dqda_clipping == 0.
+      actions_regularizer: A scalar, when positive penalizes the norm of the
+        actions. This can prevent saturation of actions for the actor_loss.
+      target_q_clipping: (tuple of floats) clips target q values within
+        (low, high) values when computing the critic loss.
+      residual_phi: (float) [0.0, 1.0] Residual algorithm parameter that
+        interpolates between Q-learning and residual gradient algorithm.
+        http://www.leemon.com/papers/1995b.pdf
+      debug_summaries: If True, add summaries to help debug behavior.
+    Raises:
+      ValueError: If 'dqda_clipping' is < 0.
+    """
+    self._observation_spec = observation_spec[0]
+    self._action_spec = action_spec[0]
+    self._state_shape = tf.TensorShape([None]).concatenate(
+        self._observation_spec.shape)
+    self._action_shape = tf.TensorShape([None]).concatenate(
+        self._action_spec.shape)
+    self._num_action_dims = self._action_spec.shape.num_elements()
+
+    self._scope = tf.get_variable_scope().name
+    self._actor_net = tf.make_template(
+        self.ACTOR_NET_SCOPE, actor_net, create_scope_now_=True)
+    self._critic_net = tf.make_template(
+        self.CRITIC_NET_SCOPE, critic_net, create_scope_now_=True)
+    self._critic_net2 = tf.make_template(
+        self.CRITIC_NET2_SCOPE, critic_net, create_scope_now_=True)
+    self._target_actor_net = tf.make_template(
+        self.TARGET_ACTOR_NET_SCOPE, actor_net, create_scope_now_=True)
+    self._target_critic_net = tf.make_template(
+        self.TARGET_CRITIC_NET_SCOPE, critic_net, create_scope_now_=True)
+    self._target_critic_net2 = tf.make_template(
+        self.TARGET_CRITIC_NET2_SCOPE, critic_net, create_scope_now_=True)
+    self._td_errors_loss = td_errors_loss
+    if dqda_clipping < 0:
+      raise ValueError('dqda_clipping must be >= 0.')
+    self._dqda_clipping = dqda_clipping
+    self._actions_regularizer = actions_regularizer
+    self._target_q_clipping = target_q_clipping
+    self._residual_phi = residual_phi
+    self._debug_summaries = debug_summaries
+
+  def get_trainable_critic_vars(self):
+    """Returns a list of trainable variables in the critic network.
+    NOTE: This gets the vars of both critic networks.
+
+    Returns:
+      A list of trainable variables in the critic network.
+    """
+    return (
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.CRITIC_NET_SCOPE)))
+
+  def critic_net(self, states, actions, for_critic_loss=False):
+    """Returns the output of the critic network.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      actions: A [batch_size, num_action_dims] tensor representing a batch
+        of actions.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    Raises:
+      ValueError: If `states` or `actions' do not have the expected dimensions.
+    """
+    values1 = self._critic_net(states, actions,
+                               for_critic_loss=for_critic_loss)
+    values2 = self._critic_net2(states, actions,
+                                for_critic_loss=for_critic_loss)
+    if for_critic_loss:
+      return values1, values2
+    return values1
+
+  def target_critic_net(self, states, actions, for_critic_loss=False):
+    """Returns the output of the target critic network.
+
+    The target network is used to compute stable targets for training.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+      actions: A [batch_size, num_action_dims] tensor representing a batch
+        of actions.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    Raises:
+      ValueError: If `states` or `actions' do not have the expected dimensions.
+    """
+    self._validate_states(states)
+    self._validate_actions(actions)
+    values1 = tf.stop_gradient(
+        self._target_critic_net(states, actions,
+                                for_critic_loss=for_critic_loss))
+    values2 = tf.stop_gradient(
+        self._target_critic_net2(states, actions,
+                                 for_critic_loss=for_critic_loss))
+    if for_critic_loss:
+      return values1, values2
+    return values1
+
+  def value_net(self, states, for_critic_loss=False):
+    """Returns the output of the critic evaluated with the actor.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    """
+    actions = self.actor_net(states)
+    return self.critic_net(states, actions,
+                           for_critic_loss=for_critic_loss)
+
+  def target_value_net(self, states, for_critic_loss=False):
+    """Returns the output of the target critic evaluated with the target actor.
+
+    Args:
+      states: A [batch_size, num_state_dims] tensor representing a batch
+        of states.
+    Returns:
+      q values: A [batch_size] tensor of q values.
+    """
+    target_actions = self.target_actor_net(states)
+    noise = tf.clip_by_value(
+        tf.random_normal(tf.shape(target_actions), stddev=0.2), -0.5, 0.5)
+    values1, values2 = self.target_critic_net(
+        states, target_actions + noise,
+        for_critic_loss=for_critic_loss)
+    values = tf.minimum(values1, values2)
+    return values, values
+
+  @gin.configurable('td3_update_targets')
+  def update_targets(self, tau=1.0):
+    """Performs a soft update of the target network parameters.
+
+    For each weight w_s in the actor/critic networks, and its corresponding
+    weight w_t in the target actor/critic networks, a soft update is:
+    w_t = (1- tau) x w_t + tau x ws
+
+    Args:
+      tau: A float scalar in [0, 1]
+    Returns:
+      An operation that performs a soft update of the target network parameters.
+    Raises:
+      ValueError: If `tau` is not in [0, 1].
+    """
+    if tau < 0 or tau > 1:
+      raise ValueError('Input `tau` should be in [0, 1].')
+    update_actor = utils.soft_variables_update(
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.ACTOR_NET_SCOPE)),
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.TARGET_ACTOR_NET_SCOPE)),
+        tau)
+    # NOTE: This updates both critic networks.
+    update_critic = utils.soft_variables_update(
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.CRITIC_NET_SCOPE)),
+        slim.get_trainable_variables(
+            utils.join_scope(self._scope, self.TARGET_CRITIC_NET_SCOPE)),
+        tau)
+    return tf.group(update_actor, update_critic, name='update_targets')
+
+
+def gen_debug_td_error_summaries(
+    target_q_values, q_values, td_targets, td_errors):
+  """Generates debug summaries for critic given a set of batch samples.
+
+  Args:
+    target_q_values: set of predicted next stage values.
+    q_values: current predicted value for the critic network.
+    td_targets: discounted target_q_values with added next stage reward.
+    td_errors: the different between td_targets and q_values.
+  """
+  with tf.name_scope('td_errors'):
+    tf.summary.histogram('td_targets', td_targets)
+    tf.summary.histogram('q_values', q_values)
+    tf.summary.histogram('target_q_values', target_q_values)
+    tf.summary.histogram('td_errors', td_errors)
+    with tf.name_scope('td_targets'):
+      tf.summary.scalar('mean', tf.reduce_mean(td_targets))
+      tf.summary.scalar('max', tf.reduce_max(td_targets))
+      tf.summary.scalar('min', tf.reduce_min(td_targets))
+    with tf.name_scope('q_values'):
+      tf.summary.scalar('mean', tf.reduce_mean(q_values))
+      tf.summary.scalar('max', tf.reduce_max(q_values))
+      tf.summary.scalar('min', tf.reduce_min(q_values))
+    with tf.name_scope('target_q_values'):
+      tf.summary.scalar('mean', tf.reduce_mean(target_q_values))
+      tf.summary.scalar('max', tf.reduce_max(target_q_values))
+      tf.summary.scalar('min', tf.reduce_min(target_q_values))
+    with tf.name_scope('td_errors'):
+      tf.summary.scalar('mean', tf.reduce_mean(td_errors))
+      tf.summary.scalar('max', tf.reduce_max(td_errors))
+      tf.summary.scalar('min', tf.reduce_min(td_errors))
+      tf.summary.scalar('mean_abs', tf.reduce_mean(tf.abs(td_errors)))
--- a/research/efficient-hrl/agents/ddpg_networks.py
+++ b/research/efficient-hrl/agents/ddpg_networks.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Sample actor(policy) and critic(q) networks to use with DDPG/NAF agents.
+
+The DDPG networks are defined in "Section 7: Experiment Details" of
+"Continuous control with deep reinforcement learning" - Lilicrap et al.
+https://arxiv.org/abs/1509.02971
+
+The NAF critic network is based on "Section 4" of "Continuous deep Q-learning
+with model-based acceleration" - Gu et al. https://arxiv.org/pdf/1603.00748.
+"""
+
+import tensorflow as tf
+slim = tf.contrib.slim
+import gin.tf
+
+
+@gin.configurable('ddpg_critic_net')
+def critic_net(states, actions,
+               for_critic_loss=False,
+               num_reward_dims=1,
+               states_hidden_layers=(400,),
+               actions_hidden_layers=None,
+               joint_hidden_layers=(300,),
+               weight_decay=0.0001,
+               normalizer_fn=None,
+               activation_fn=tf.nn.relu,
+               zero_obs=False,
+               images=False):
+  """Creates a critic that returns q values for the given states and actions.
+
+  Args:
+    states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
+      representing a batch of states.
+    actions: (castable to tf.float32) a [batch_size, num_action_dims] tensor
+      representing a batch of actions.
+    num_reward_dims: Number of reward dimensions.
+    states_hidden_layers: tuple of hidden layers units for states.
+    actions_hidden_layers: tuple of hidden layers units for actions.
+    joint_hidden_layers: tuple of hidden layers units after joining states
+      and actions using tf.concat().
+    weight_decay: Weight decay for l2 weights regularizer.
+    normalizer_fn: Normalizer function, i.e. slim.layer_norm,
+    activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
+  Returns:
+    A tf.float32 [batch_size] tensor of q values, or a tf.float32
+      [batch_size, num_reward_dims] tensor of vector q values if
+      num_reward_dims > 1.
+  """
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_regularizer=slim.l2_regularizer(weight_decay),
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+
+    orig_states = tf.to_float(states)
+    #states = tf.to_float(states)
+    states = tf.concat([tf.to_float(states), tf.to_float(actions)], -1)  #TD3
+    if images or zero_obs:
+      states *= tf.constant([0.0] * 2 + [1.0] * (states.shape[1] - 2))  #LALA
+    actions = tf.to_float(actions)
+    if states_hidden_layers:
+      states = slim.stack(states, slim.fully_connected, states_hidden_layers,
+                          scope='states')
+    if actions_hidden_layers:
+      actions = slim.stack(actions, slim.fully_connected, actions_hidden_layers,
+                           scope='actions')
+    joint = tf.concat([states, actions], 1)
+    if joint_hidden_layers:
+      joint = slim.stack(joint, slim.fully_connected, joint_hidden_layers,
+                         scope='joint')
+    with slim.arg_scope([slim.fully_connected],
+                        weights_regularizer=None,
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      value = slim.fully_connected(joint, num_reward_dims,
+                                   activation_fn=None,
+                                   normalizer_fn=None,
+                                   scope='q_value')
+    if num_reward_dims == 1:
+      value = tf.reshape(value, [-1])
+    if not for_critic_loss and num_reward_dims > 1:
+      value = tf.reduce_sum(
+          value * tf.abs(orig_states[:, -num_reward_dims:]), -1)
+  return value
+
+
+@gin.configurable('ddpg_actor_net')
+def actor_net(states, action_spec,
+              hidden_layers=(400, 300),
+              normalizer_fn=None,
+              activation_fn=tf.nn.relu,
+              zero_obs=False,
+              images=False):
+  """Creates an actor that returns actions for the given states.
+
+  Args:
+    states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
+      representing a batch of states.
+    action_spec: (BoundedTensorSpec) A tensor spec indicating the shape
+      and range of actions.
+    hidden_layers: tuple of hidden layers units.
+    normalizer_fn: Normalizer function, i.e. slim.layer_norm,
+    activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
+  Returns:
+    A tf.float32 [batch_size, num_action_dims] tensor of actions.
+  """
+
+  with slim.arg_scope(
+      [slim.fully_connected],
+      activation_fn=activation_fn,
+      normalizer_fn=normalizer_fn,
+      weights_initializer=slim.variance_scaling_initializer(
+          factor=1.0/3.0, mode='FAN_IN', uniform=True)):
+
+    states = tf.to_float(states)
+    orig_states = states
+    if images or zero_obs:  # Zero-out x, y position. Hacky.
+      states *= tf.constant([0.0] * 2 + [1.0] * (states.shape[1] - 2))
+    if hidden_layers:
+      states = slim.stack(states, slim.fully_connected, hidden_layers,
+                          scope='states')
+    with slim.arg_scope([slim.fully_connected],
+                        weights_initializer=tf.random_uniform_initializer(
+                            minval=-0.003, maxval=0.003)):
+      actions = slim.fully_connected(states,
+                                     action_spec.shape.num_elements(),
+                                     scope='actions',
+                                     normalizer_fn=None,
+                                     activation_fn=tf.nn.tanh)
+      action_means = (action_spec.maximum + action_spec.minimum) / 2.0
+      action_magnitudes = (action_spec.maximum - action_spec.minimum) / 2.0
+      actions = action_means + action_magnitudes * actions
+
+  return actions
--- a/research/efficient-hrl/cond_fn.py
+++ b/research/efficient-hrl/cond_fn.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Defines many boolean functions indicating when to step and reset.
+"""
+
+import tensorflow as tf
+import gin.tf
+
+
+@gin.configurable
+def env_transition(agent, state, action, transition_type, environment_steps,
+                   num_episodes):
+  """True if the transition_type is TRANSITION or FINAL_TRANSITION.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: Returns an op that evaluates to true if the transition type is
+    not RESTARTING
+  """
+  del agent, state, action, num_episodes, environment_steps
+  cond = tf.logical_not(transition_type)
+  return cond
+
+
+@gin.configurable
+def env_restart(agent, state, action, transition_type, environment_steps,
+                num_episodes):
+  """True if the transition_type is RESTARTING.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: Returns an op that evaluates to true if the transition type equals
+    RESTARTING.
+  """
+  del agent, state, action, num_episodes, environment_steps
+  cond = tf.identity(transition_type)
+  return cond
+
+
+@gin.configurable
+def every_n_steps(agent,
+                  state,
+                  action,
+                  transition_type,
+                  environment_steps,
+                  num_episodes,
+                  n=150):
+  """True once every n steps.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    n: Return true once every n steps.
+  Returns:
+    cond: Returns an op that evaluates to true if environment_steps
+    equals 0 mod n. We increment the step before checking this condition, so
+    we do not need to add one to environment_steps.
+  """
+  del agent, state, action, transition_type, num_episodes
+  cond = tf.equal(tf.mod(environment_steps, n), 0)
+  return cond
+
+
+@gin.configurable
+def every_n_episodes(agent,
+                     state,
+                     action,
+                     transition_type,
+                     environment_steps,
+                     num_episodes,
+                     n=2,
+                     steps_per_episode=None):
+  """True once every n episodes.
+
+  Specifically, evaluates to True on the 0th step of every nth episode.
+  Unlike environment_steps, num_episodes starts at 0, so we do want to add
+  one to ensure it does not reset on the first call.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    n: Return true once every n episodes.
+    steps_per_episode: How many steps per episode. Needed to determine when a
+    new episode starts.
+  Returns:
+    cond: Returns an op that evaluates to true on the last step of the episode
+      (i.e. if num_episodes equals 0 mod n).
+  """
+  assert steps_per_episode is not None
+  del agent, action, transition_type
+  ant_fell = tf.logical_or(state[2] < 0.2, state[2] > 1.0)
+  cond = tf.logical_and(
+      tf.logical_or(
+          ant_fell,
+          tf.equal(tf.mod(num_episodes + 1, n), 0)),
+      tf.equal(tf.mod(environment_steps, steps_per_episode), 0))
+  return cond
+
+
+@gin.configurable
+def failed_reset_after_n_episodes(agent,
+                                  state,
+                                  action,
+                                  transition_type,
+                                  environment_steps,
+                                  num_episodes,
+                                  steps_per_episode=None,
+                                  reset_state=None,
+                                  max_dist=1.0,
+                                  epsilon=1e-10):
+  """Every n episodes, returns True if the reset agent fails to return.
+
+  Specifically, evaluates to True if the distance between the state and the
+  reset state is greater than max_dist at the end of the episode.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    steps_per_episode: How many steps per episode. Needed to determine when a
+    new episode starts.
+    reset_state: State to which the reset controller should return.
+    max_dist: Agent is considered to have successfully reset if its distance
+    from the reset_state is less than max_dist.
+    epsilon: small offset to ensure non-negative/zero distance.
+  Returns:
+    cond: Returns an op that evaluates to true if num_episodes+1 equals 0
+    mod n. We add one to the num_episodes so the environment is not reset after
+    the 0th step.
+  """
+  assert steps_per_episode is not None
+  assert reset_state is not None
+  del agent, state, action, transition_type, num_episodes
+  dist = tf.sqrt(
+      tf.reduce_sum(tf.squared_difference(state, reset_state)) + epsilon)
+  cond = tf.logical_and(
+      tf.greater(dist, tf.constant(max_dist)),
+      tf.equal(tf.mod(environment_steps, steps_per_episode), 0))
+  return cond
+
+
+@gin.configurable
+def q_too_small(agent,
+                state,
+                action,
+                transition_type,
+                environment_steps,
+                num_episodes,
+                q_min=0.5):
+  """True of q is too small.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+    q_min: Returns true if the qval is less than q_min
+  Returns:
+    cond: Returns an op that evaluates to true if qval is less than q_min.
+  """
+  del transition_type, environment_steps, num_episodes
+  state_for_reset_agent = tf.stack(state[:-1], tf.constant([0], dtype=tf.float))
+  qval = agent.BASE_AGENT_CLASS.critic_net(
+      tf.expand_dims(state_for_reset_agent, 0), tf.expand_dims(action, 0))[0, :]
+  cond = tf.greater(tf.constant(q_min), qval)
+  return cond
+
+
+@gin.configurable
+def true_fn(agent, state, action, transition_type, environment_steps,
+            num_episodes):
+  """Returns an op that evaluates to true.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: op that always evaluates to True.
+  """
+  del agent, state, action, transition_type, environment_steps, num_episodes
+  cond = tf.constant(True, dtype=tf.bool)
+  return cond
+
+
+@gin.configurable
+def false_fn(agent, state, action, transition_type, environment_steps,
+             num_episodes):
+  """Returns an op that evaluates to false.
+
+  Args:
+    agent: RL agent.
+    state: A [num_state_dims] tensor representing a state.
+    action: Action performed.
+    transition_type: Type of transition after action
+    environment_steps: Number of steps performed by environment.
+    num_episodes: Number of episodes.
+  Returns:
+    cond: op that always evaluates to False.
+  """
+  del agent, state, action, transition_type, environment_steps, num_episodes
+  cond = tf.constant(False, dtype=tf.bool)
+  return cond
--- a/research/efficient-hrl/configs/base_uvf.gin
+++ b/research/efficient-hrl/configs/base_uvf.gin
+#-*-Python-*-
+import gin.tf.external_configurables
+
+create_maze_env.top_down_view = %IMAGES
+## Create the agent
+AGENT_CLASS = @UvfAgent
+UvfAgent.tf_context = %CONTEXT
+UvfAgent.actor_net = @agent/ddpg_actor_net
+UvfAgent.critic_net = @agent/ddpg_critic_net
+UvfAgent.dqda_clipping = 0.0
+UvfAgent.td_errors_loss = @tf.losses.huber_loss
+UvfAgent.target_q_clipping = %TARGET_Q_CLIPPING
+
+# Create meta agent
+META_CLASS = @MetaAgent
+MetaAgent.tf_context = %META_CONTEXT
+MetaAgent.sub_context = %CONTEXT
+MetaAgent.actor_net = @meta/ddpg_actor_net
+MetaAgent.critic_net = @meta/ddpg_critic_net
+MetaAgent.dqda_clipping = 0.0
+MetaAgent.td_errors_loss = @tf.losses.huber_loss
+MetaAgent.target_q_clipping = %TARGET_Q_CLIPPING
+
+# Create state preprocess
+STATE_PREPROCESS_CLASS = @StatePreprocess
+StatePreprocess.ndims = %SUBGOAL_DIM
+state_preprocess_net.states_hidden_layers = (100, 100)
+state_preprocess_net.num_output_dims = %SUBGOAL_DIM
+state_preprocess_net.images = %IMAGES
+action_embed_net.num_output_dims = %SUBGOAL_DIM
+INVERSE_DYNAMICS_CLASS = @InverseDynamics
+
+# actor_net
+ACTOR_HIDDEN_SIZE_1 = 300
+ACTOR_HIDDEN_SIZE_2 = 300
+agent/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
+agent/ddpg_actor_net.activation_fn = @tf.nn.relu
+agent/ddpg_actor_net.zero_obs = %ZERO_OBS
+agent/ddpg_actor_net.images = %IMAGES
+meta/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
+meta/ddpg_actor_net.activation_fn = @tf.nn.relu
+meta/ddpg_actor_net.zero_obs = False
+meta/ddpg_actor_net.images = %IMAGES
+# critic_net
+CRITIC_HIDDEN_SIZE_1 = 300
+CRITIC_HIDDEN_SIZE_2 = 300
+agent/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
+agent/ddpg_critic_net.actions_hidden_layers = None
+agent/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
+agent/ddpg_critic_net.weight_decay = 0.0
+agent/ddpg_critic_net.activation_fn = @tf.nn.relu
+agent/ddpg_critic_net.zero_obs = %ZERO_OBS
+agent/ddpg_critic_net.images = %IMAGES
+meta/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
+meta/ddpg_critic_net.actions_hidden_layers = None
+meta/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
+meta/ddpg_critic_net.weight_decay = 0.0
+meta/ddpg_critic_net.activation_fn = @tf.nn.relu
+meta/ddpg_critic_net.zero_obs = False
+meta/ddpg_critic_net.images = %IMAGES
+
+tf.losses.huber_loss.delta = 1.0
+# Sample action
+uvf_add_noise_fn.stddev = 1.0
+meta_add_noise_fn.stddev = %META_EXPLORE_NOISE
+# Update targets
+ddpg_update_targets.tau = 0.001
+td3_update_targets.tau = 0.005
--- a/research/efficient-hrl/configs/eval_uvf.gin
+++ b/research/efficient-hrl/configs/eval_uvf.gin
+#-*-Python-*-
+# Config eval
+evaluate.environment = @create_maze_env()
+evaluate.agent_class = %AGENT_CLASS
+evaluate.meta_agent_class = %META_CLASS
+evaluate.state_preprocess_class = %STATE_PREPROCESS_CLASS
+evaluate.num_episodes_eval = 50
+evaluate.num_episodes_videos = 1
+evaluate.gamma = 1.0
+evaluate.eval_interval_secs = 1
+evaluate.generate_videos = False
+evaluate.generate_summaries = True
+evaluate.eval_modes = %EVAL_MODES
+evaluate.max_steps_per_episode = %RESET_EPISODE_PERIOD
--- a/research/efficient-hrl/configs/train_uvf.gin
+++ b/research/efficient-hrl/configs/train_uvf.gin
+#-*-Python-*-
+# Create replay_buffer
+agent/CircularBuffer.buffer_size = 200000
+meta/CircularBuffer.buffer_size = 200000
+agent/CircularBuffer.scope = "agent"
+meta/CircularBuffer.scope = "meta"
+
+# Config train
+train_uvf.environment = @create_maze_env()
+train_uvf.agent_class = %AGENT_CLASS
+train_uvf.meta_agent_class = %META_CLASS
+train_uvf.state_preprocess_class = %STATE_PREPROCESS_CLASS
+train_uvf.inverse_dynamics_class = %INVERSE_DYNAMICS_CLASS
+train_uvf.replay_buffer = @agent/CircularBuffer()
+train_uvf.meta_replay_buffer = @meta/CircularBuffer()
+train_uvf.critic_optimizer = @critic/AdamOptimizer()
+train_uvf.actor_optimizer = @actor/AdamOptimizer()
+train_uvf.meta_critic_optimizer = @meta_critic/AdamOptimizer()
+train_uvf.meta_actor_optimizer = @meta_actor/AdamOptimizer()
+train_uvf.repr_optimizer = @repr/AdamOptimizer()
+train_uvf.num_episodes_train = 25000
+train_uvf.batch_size = 100
+train_uvf.initial_episodes = 5
+train_uvf.gamma = 0.99
+train_uvf.meta_gamma = 0.99
+train_uvf.reward_scale_factor = 1.0
+train_uvf.target_update_period = 2
+train_uvf.num_updates_per_observation = 1
+train_uvf.num_collect_per_update = 1
+train_uvf.num_collect_per_meta_update = 10
+train_uvf.debug_summaries = False
+train_uvf.log_every_n_steps = 1000
+train_uvf.save_policy_every_n_steps =100000
+
+# Config Optimizers
+critic/AdamOptimizer.learning_rate = 0.001
+critic/AdamOptimizer.beta1 = 0.9
+critic/AdamOptimizer.beta2 = 0.999
+actor/AdamOptimizer.learning_rate = 0.0001
+actor/AdamOptimizer.beta1 = 0.9
+actor/AdamOptimizer.beta2 = 0.999
+
+meta_critic/AdamOptimizer.learning_rate = 0.001
+meta_critic/AdamOptimizer.beta1 = 0.9
+meta_critic/AdamOptimizer.beta2 = 0.999
+meta_actor/AdamOptimizer.learning_rate = 0.0001
+meta_actor/AdamOptimizer.beta1 = 0.9
+meta_actor/AdamOptimizer.beta2 = 0.999
+
+repr/AdamOptimizer.learning_rate = 0.0001
+repr/AdamOptimizer.beta1 = 0.9
+repr/AdamOptimizer.beta2 = 0.999
--- a/research/efficient-hrl/context/__init__.py
+++ b/research/efficient-hrl/context/__init__.py
+
--- a/research/efficient-hrl/context/configs/ant_block.gin
+++ b/research/efficient-hrl/context/configs/ant_block.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntBlock"
+ZERO_OBS = False
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [3, 4]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_block_maze.gin
+++ b/research/efficient-hrl/context/configs/ant_block_maze.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntBlockMaze"
+ZERO_OBS = False
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (12, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [3, 4]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [8, 0]
+eval2/ConstantSampler.value = [8, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_fall_multi.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_multi.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [0, 27, 4.5]
--- a/research/efficient-hrl/context/configs/ant_fall_multi_img.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_multi_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+IMAGES = True
+
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 3
+task/relative_context_multi_transition_fn.k = 3
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [0, 27, 0]
--- a/research/efficient-hrl/context/configs/ant_fall_single.gin
+++ b/research/efficient-hrl/context/configs/ant_fall_single.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntFall"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4, 0), (12, 28, 5))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [3]
+meta/Context.samplers = {
+    "train": [@eval1/ConstantSampler],
+    "explore": [@eval1/ConstantSampler],
+    "eval1": [@eval1/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1, 2]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [0, 27, 4.5]
--- a/research/efficient-hrl/context/configs/ant_maze.gin
+++ b/research/efficient-hrl/context/configs/ant_maze.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntMaze"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_maze_img.gin
+++ b/research/efficient-hrl/context/configs/ant_maze_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntMaze"
+IMAGES = True
+
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-4, -4), (20, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval1", "eval2", "eval3"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval1": [@eval1/ConstantSampler],
+    "eval2": [@eval2/ConstantSampler],
+    "eval3": [@eval3/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 2
+task/relative_context_multi_transition_fn.k = 2
+MetaAgent.k = %SUBGOAL_DIM
+
+eval1/ConstantSampler.value = [16, 0]
+eval2/ConstantSampler.value = [16, 16]
+eval3/ConstantSampler.value = [0, 16]
--- a/research/efficient-hrl/context/configs/ant_push_multi.gin
+++ b/research/efficient-hrl/context/configs/ant_push_multi.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntPush"
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-16, -4), (16, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval2"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval2": [@eval2/ConstantSampler],
+}
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = False
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+MetaAgent.k = %SUBGOAL_DIM
+
+eval2/ConstantSampler.value = [0, 19]
--- a/research/efficient-hrl/context/configs/ant_push_multi_img.gin
+++ b/research/efficient-hrl/context/configs/ant_push_multi_img.gin
+#-*-Python-*-
+create_maze_env.env_name = "AntPush"
+IMAGES = True
+
+context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
+meta_context_range = ((-16, -4), (16, 20))
+
+RESET_EPISODE_PERIOD = 500
+RESET_ENV_PERIOD = 1
+# End episode every N steps
+UvfAgent.reset_episode_cond_fn = @every_n_steps
+every_n_steps.n = %RESET_EPISODE_PERIOD
+train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
+# Do a manual reset every N episodes
+UvfAgent.reset_env_cond_fn = @every_n_episodes
+every_n_episodes.n = %RESET_ENV_PERIOD
+every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
+
+## Config defaults
+EVAL_MODES = ["eval2"]
+
+## Config agent
+CONTEXT = @agent/Context
+META_CONTEXT = @meta/Context
+
+## Config agent context
+agent/Context.context_ranges = [%context_range]
+agent/Context.context_shapes = [%SUBGOAL_DIM]
+agent/Context.meta_action_every_n = 10
+agent/Context.samplers = {
+    "train": [@train/DirectionSampler],
+    "explore": [@train/DirectionSampler],
+}
+
+agent/Context.context_transition_fn = @relative_context_transition_fn
+agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
+
+agent/Context.reward_fn = @uvf/negative_distance
+
+## Config meta context
+meta/Context.context_ranges = [%meta_context_range]
+meta/Context.context_shapes = [2]
+meta/Context.samplers = {
+    "train": [@train/RandomSampler],
+    "explore": [@train/RandomSampler],
+    "eval2": [@eval2/ConstantSampler],
+}
+meta/Context.context_transition_fn = @task/relative_context_transition_fn
+meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
+meta/Context.reward_fn = @task/negative_distance
+
+## Config rewards
+task/negative_distance.state_indices = [0, 1]
+task/negative_distance.relative_context = True
+task/negative_distance.diff = False
+task/negative_distance.offset = 0.0
+
+## Config samplers
+train/RandomSampler.context_range = %meta_context_range
+train/DirectionSampler.context_range = %context_range
+train/DirectionSampler.k = %SUBGOAL_DIM
+relative_context_transition_fn.k = %SUBGOAL_DIM
+relative_context_multi_transition_fn.k = %SUBGOAL_DIM
+task/relative_context_transition_fn.k = 2
+task/relative_context_multi_transition_fn.k = 2
+MetaAgent.k = %SUBGOAL_DIM
+
+eval2/ConstantSampler.value = [0, 19]