Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
052361de
Commit
052361de
authored
Dec 05, 2018
by
ofirnachum
Browse files
add training code
parent
9b969ca5
Changes
51
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2968 additions
and
8 deletions
+2968
-8
research/efficient-hrl/README.md
research/efficient-hrl/README.md
+42
-8
research/efficient-hrl/agent.py
research/efficient-hrl/agent.py
+774
-0
research/efficient-hrl/agents/__init__.py
research/efficient-hrl/agents/__init__.py
+1
-0
research/efficient-hrl/agents/circular_buffer.py
research/efficient-hrl/agents/circular_buffer.py
+289
-0
research/efficient-hrl/agents/ddpg_agent.py
research/efficient-hrl/agents/ddpg_agent.py
+739
-0
research/efficient-hrl/agents/ddpg_networks.py
research/efficient-hrl/agents/ddpg_networks.py
+150
-0
research/efficient-hrl/cond_fn.py
research/efficient-hrl/cond_fn.py
+244
-0
research/efficient-hrl/configs/base_uvf.gin
research/efficient-hrl/configs/base_uvf.gin
+68
-0
research/efficient-hrl/configs/eval_uvf.gin
research/efficient-hrl/configs/eval_uvf.gin
+14
-0
research/efficient-hrl/configs/train_uvf.gin
research/efficient-hrl/configs/train_uvf.gin
+52
-0
research/efficient-hrl/context/__init__.py
research/efficient-hrl/context/__init__.py
+1
-0
research/efficient-hrl/context/configs/ant_block.gin
research/efficient-hrl/context/configs/ant_block.gin
+67
-0
research/efficient-hrl/context/configs/ant_block_maze.gin
research/efficient-hrl/context/configs/ant_block_maze.gin
+67
-0
research/efficient-hrl/context/configs/ant_fall_multi.gin
research/efficient-hrl/context/configs/ant_fall_multi.gin
+62
-0
research/efficient-hrl/context/configs/ant_fall_multi_img.gin
...arch/efficient-hrl/context/configs/ant_fall_multi_img.gin
+68
-0
research/efficient-hrl/context/configs/ant_fall_single.gin
research/efficient-hrl/context/configs/ant_fall_single.gin
+62
-0
research/efficient-hrl/context/configs/ant_maze.gin
research/efficient-hrl/context/configs/ant_maze.gin
+66
-0
research/efficient-hrl/context/configs/ant_maze_img.gin
research/efficient-hrl/context/configs/ant_maze_img.gin
+72
-0
research/efficient-hrl/context/configs/ant_push_multi.gin
research/efficient-hrl/context/configs/ant_push_multi.gin
+62
-0
research/efficient-hrl/context/configs/ant_push_multi_img.gin
...arch/efficient-hrl/context/configs/ant_push_multi_img.gin
+68
-0
No files found.
research/efficient-hrl/README.md
View file @
052361de
Code for performing Hierarchical RL based on
Code for performing Hierarchical RL based on the following publications:
"Data-Efficient Hierarchical Reinforcement Learning" by
Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
(https://arxiv.org/abs/1805.08296).
This library currently includes three of the environments used:
Ant Maze, Ant Push, and Ant Fall.
The training code is planned to be open-sourced at a later time.
"Near-Optimal Representation Learning for Hierarchical Reinforcement Learning"
by Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
(https://arxiv.org/abs/1810.01257).
Requirements:
*
TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
*
Gin Config (see https://github.com/google/gin-config)
*
Tensorflow Agents (see https://github.com/tensorflow/agents)
*
OpenAI Gym (see http://gym.openai.com/docs, be sure to install MuJoCo as well)
*
NumPy (see http://www.numpy.org/)
Quick Start:
Run a random policy on AntMaze (or AntPush, AntFall):
Run a training job based on the original HIRO paper on Ant Maze:
```
python scripts/local_train.py test1 hiro_orig ant_maze base_uvf suite
```
Run a continuous evaluation job for that experiment:
```
python
environments/__init__.py --env=AntMaz
e
python
scripts/local_eval.py test1 hiro_orig ant_maze base_uvf suit
e
```
To run the same experiment with online representation learning (the
"Near-Optimal" paper), change
`hiro_orig`
to
`hiro_repr`
.
You can also run with
`hiro_xy`
to run the same experiment with HIRO on only the
xy coordinates of the agent.
To run on other environments, change
`ant_maze`
to something else; e.g.,
`ant_push_multi`
,
`ant_fall_multi`
, etc. See
`context/configs/*`
for other options.
Basic Code Guide:
The code for training resides in train.py. The code trains a lower-level policy
(a UVF agent in the code) and a higher-level policy (a MetaAgent in the code)
concurrently. The higher-level policy communicates goals to the lower-level
policy. In the code, this is called a context. Not only does the lower-level
policy act with respect to a context (a higher-level specified goal), but the
higher-level policy also acts with respect to an environment-specified context
(corresponding to the navigation target location associated with the task).
Therefore, in
`context/configs/*`
you will find both specifications for task setup
as well as goal configurations. Most remaining hyperparameters used for
training/evaluation may be found in
`configs/*`
.
NOTE: Not all the code corresponding to the "Near-Optimal" paper is included.
Namely, changes to low-level policy training proposed in the paper (discounting
and auxiliary rewards) are not implemented here. Performance should not change
significantly.
Maintained by Ofir Nachum (ofirnachum).
research/efficient-hrl/agent.py
0 → 100644
View file @
052361de
This diff is collapsed.
Click to expand it.
research/efficient-hrl/agents/__init__.py
0 → 100644
View file @
052361de
research/efficient-hrl/agents/circular_buffer.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A circular buffer where each element is a list of tensors.
Each element of the buffer is a list of tensors. An example use case is a replay
buffer in reinforcement learning, where each element is a list of tensors
representing the state, action, reward etc.
New elements are added sequentially, and once the buffer is full, we
start overwriting them in a circular fashion. Reading does not remove any
elements, only adding new elements does.
"""
import
collections
import
numpy
as
np
import
tensorflow
as
tf
import
gin.tf
@
gin
.
configurable
class
CircularBuffer
(
object
):
"""A circular buffer where each element is a list of tensors."""
def
__init__
(
self
,
buffer_size
=
1000
,
scope
=
'replay_buffer'
):
"""Circular buffer of list of tensors.
Args:
buffer_size: (integer) maximum number of tensor lists the buffer can hold.
scope: (string) variable scope for creating the variables.
"""
self
.
_buffer_size
=
np
.
int64
(
buffer_size
)
self
.
_scope
=
scope
self
.
_tensors
=
collections
.
OrderedDict
()
with
tf
.
variable_scope
(
self
.
_scope
):
self
.
_num_adds
=
tf
.
Variable
(
0
,
dtype
=
tf
.
int64
,
name
=
'num_adds'
)
self
.
_num_adds_cs
=
tf
.
contrib
.
framework
.
CriticalSection
(
name
=
'num_adds'
)
@
property
def
buffer_size
(
self
):
return
self
.
_buffer_size
@
property
def
scope
(
self
):
return
self
.
_scope
@
property
def
num_adds
(
self
):
return
self
.
_num_adds
def
_create_variables
(
self
,
tensors
):
with
tf
.
variable_scope
(
self
.
_scope
):
for
name
in
tensors
.
keys
():
tensor
=
tensors
[
name
]
self
.
_tensors
[
name
]
=
tf
.
get_variable
(
name
=
'BufferVariable_'
+
name
,
shape
=
[
self
.
_buffer_size
]
+
tensor
.
get_shape
().
as_list
(),
dtype
=
tensor
.
dtype
,
trainable
=
False
)
def
_validate
(
self
,
tensors
):
"""Validate shapes of tensors."""
if
len
(
tensors
)
!=
len
(
self
.
_tensors
):
raise
ValueError
(
'Expected tensors to have %d elements. Received %d '
'instead.'
%
(
len
(
self
.
_tensors
),
len
(
tensors
)))
if
self
.
_tensors
.
keys
()
!=
tensors
.
keys
():
raise
ValueError
(
'The keys of tensors should be the always the same.'
'Received %s instead %s.'
%
(
tensors
.
keys
(),
self
.
_tensors
.
keys
()))
for
name
,
tensor
in
tensors
.
items
():
if
tensor
.
get_shape
().
as_list
()
!=
self
.
_tensors
[
name
].
get_shape
().
as_list
()[
1
:]:
raise
ValueError
(
'Tensor %s has incorrect shape.'
%
name
)
if
not
tensor
.
dtype
.
is_compatible_with
(
self
.
_tensors
[
name
].
dtype
):
raise
ValueError
(
'Tensor %s has incorrect data type. Expected %s, received %s'
%
(
name
,
self
.
_tensors
[
name
].
read_value
().
dtype
,
tensor
.
dtype
))
def
add
(
self
,
tensors
):
"""Adds an element (list/tuple/dict of tensors) to the buffer.
Args:
tensors: (list/tuple/dict of tensors) to be added to the buffer.
Returns:
An add operation that adds the input `tensors` to the buffer. Similar to
an enqueue_op.
Raises:
ValueError: If the shapes and data types of input `tensors' are not the
same across calls to the add function.
"""
return
self
.
maybe_add
(
tensors
,
True
)
def
maybe_add
(
self
,
tensors
,
condition
):
"""Adds an element (tensors) to the buffer based on the condition..
Args:
tensors: (list/tuple of tensors) to be added to the buffer.
condition: A boolean Tensor controlling whether the tensors would be added
to the buffer or not.
Returns:
An add operation that adds the input `tensors` to the buffer. Similar to
an maybe_enqueue_op.
Raises:
ValueError: If the shapes and data types of input `tensors' are not the
same across calls to the add function.
"""
if
not
isinstance
(
tensors
,
dict
):
names
=
[
str
(
i
)
for
i
in
range
(
len
(
tensors
))]
tensors
=
collections
.
OrderedDict
(
zip
(
names
,
tensors
))
if
not
isinstance
(
tensors
,
collections
.
OrderedDict
):
tensors
=
collections
.
OrderedDict
(
sorted
(
tensors
.
items
(),
key
=
lambda
t
:
t
[
0
]))
if
not
self
.
_tensors
:
self
.
_create_variables
(
tensors
)
else
:
self
.
_validate
(
tensors
)
#@tf.critical_section(self._position_mutex)
def
_increment_num_adds
():
# Adding 0 to the num_adds variable is a trick to read the value of the
# variable and return a read-only tensor. Doing this in a critical
# section allows us to capture a snapshot of the variable that will
# not be affected by other threads updating num_adds.
return
self
.
_num_adds
.
assign_add
(
1
)
+
0
def
_add
():
num_adds_inc
=
self
.
_num_adds_cs
.
execute
(
_increment_num_adds
)
current_pos
=
tf
.
mod
(
num_adds_inc
-
1
,
self
.
_buffer_size
)
update_ops
=
[]
for
name
in
self
.
_tensors
.
keys
():
update_ops
.
append
(
tf
.
scatter_update
(
self
.
_tensors
[
name
],
current_pos
,
tensors
[
name
]))
return
tf
.
group
(
*
update_ops
)
return
tf
.
contrib
.
framework
.
smart_cond
(
condition
,
_add
,
tf
.
no_op
)
def
get_random_batch
(
self
,
batch_size
,
keys
=
None
,
num_steps
=
1
):
"""Samples a batch of tensors from the buffer with replacement.
Args:
batch_size: (integer) number of elements to sample.
keys: List of keys of tensors to retrieve. If None retrieve all.
num_steps: (integer) length of trajectories to return. If > 1 will return
a list of lists, where each internal list represents a trajectory of
length num_steps.
Returns:
A list of tensors, where each element in the list is a batch sampled from
one of the tensors in the buffer.
Raises:
ValueError: If get_random_batch is called before calling the add function.
tf.errors.InvalidArgumentError: If this operation is executed before any
items are added to the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before get_random_batch.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
latest_start_index
=
self
.
get_num_adds
()
-
num_steps
+
1
empty_buffer_assert
=
tf
.
Assert
(
tf
.
greater
(
latest_start_index
,
0
),
[
'Not enough elements have been added to the buffer.'
])
with
tf
.
control_dependencies
([
empty_buffer_assert
]):
max_index
=
tf
.
minimum
(
self
.
_buffer_size
,
latest_start_index
)
indices
=
tf
.
random_uniform
(
[
batch_size
],
minval
=
0
,
maxval
=
max_index
,
dtype
=
tf
.
int64
)
if
num_steps
==
1
:
return
self
.
gather
(
indices
,
keys
)
else
:
return
self
.
gather_nstep
(
num_steps
,
indices
,
keys
)
def
gather
(
self
,
indices
,
keys
=
None
):
"""Returns elements at the specified indices from the buffer.
Args:
indices: (list of integers or rank 1 int Tensor) indices in the buffer to
retrieve elements from.
keys: List of keys of tensors to retrieve. If None retrieve all.
Returns:
A list of tensors, where each element in the list is obtained by indexing
one of the tensors in the buffer.
Raises:
ValueError: If gather is called before calling the add function.
tf.errors.InvalidArgumentError: If indices are bigger than the number of
items in the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before calling gather.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
with
tf
.
name_scope
(
'Gather'
):
index_bound_assert
=
tf
.
Assert
(
tf
.
less
(
tf
.
to_int64
(
tf
.
reduce_max
(
indices
)),
tf
.
minimum
(
self
.
get_num_adds
(),
self
.
_buffer_size
)),
[
'Index out of bounds.'
])
with
tf
.
control_dependencies
([
index_bound_assert
]):
indices
=
tf
.
convert_to_tensor
(
indices
)
batch
=
[]
for
key
in
keys
:
batch
.
append
(
tf
.
gather
(
self
.
_tensors
[
key
],
indices
,
name
=
key
))
return
batch
def
gather_nstep
(
self
,
num_steps
,
indices
,
keys
=
None
):
"""Returns elements at the specified indices from the buffer.
Args:
num_steps: (integer) length of trajectories to return.
indices: (list of rank num_steps int Tensor) indices in the buffer to
retrieve elements from for multiple trajectories. Each Tensor in the
list represents the indices for a trajectory.
keys: List of keys of tensors to retrieve. If None retrieve all.
Returns:
A list of list-of-tensors, where each element in the list is obtained by
indexing one of the tensors in the buffer.
Raises:
ValueError: If gather is called before calling the add function.
tf.errors.InvalidArgumentError: If indices are bigger than the number of
items in the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before calling gather.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
with
tf
.
name_scope
(
'Gather'
):
index_bound_assert
=
tf
.
Assert
(
tf
.
less_equal
(
tf
.
to_int64
(
tf
.
reduce_max
(
indices
)
+
num_steps
),
self
.
get_num_adds
()),
[
'Trajectory indices go out of bounds.'
])
with
tf
.
control_dependencies
([
index_bound_assert
]):
indices
=
tf
.
map_fn
(
lambda
x
:
tf
.
mod
(
tf
.
range
(
x
,
x
+
num_steps
),
self
.
_buffer_size
),
indices
,
dtype
=
tf
.
int64
)
batch
=
[]
for
key
in
keys
:
def
SampleTrajectories
(
trajectory_indices
,
key
=
key
,
num_steps
=
num_steps
):
trajectory_indices
.
set_shape
([
num_steps
])
return
tf
.
gather
(
self
.
_tensors
[
key
],
trajectory_indices
,
name
=
key
)
batch
.
append
(
tf
.
map_fn
(
SampleTrajectories
,
indices
,
dtype
=
self
.
_tensors
[
key
].
dtype
))
return
batch
def
get_position
(
self
):
"""Returns the position at which the last element was added.
Returns:
An int tensor representing the index at which the last element was added
to the buffer or -1 if no elements were added.
"""
return
tf
.
cond
(
self
.
get_num_adds
()
<
1
,
lambda
:
self
.
get_num_adds
()
-
1
,
lambda
:
tf
.
mod
(
self
.
get_num_adds
()
-
1
,
self
.
_buffer_size
))
def
get_num_adds
(
self
):
"""Returns the number of additions to the buffer.
Returns:
An int tensor representing the number of elements that were added.
"""
def
num_adds
():
return
self
.
_num_adds
.
value
()
return
self
.
_num_adds_cs
.
execute
(
num_adds
)
def
get_num_tensors
(
self
):
"""Returns the number of tensors (slots) in the buffer."""
return
len
(
self
.
_tensors
)
research/efficient-hrl/agents/ddpg_agent.py
0 → 100644
View file @
052361de
This diff is collapsed.
Click to expand it.
research/efficient-hrl/agents/ddpg_networks.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Sample actor(policy) and critic(q) networks to use with DDPG/NAF agents.
The DDPG networks are defined in "Section 7: Experiment Details" of
"Continuous control with deep reinforcement learning" - Lilicrap et al.
https://arxiv.org/abs/1509.02971
The NAF critic network is based on "Section 4" of "Continuous deep Q-learning
with model-based acceleration" - Gu et al. https://arxiv.org/pdf/1603.00748.
"""
import
tensorflow
as
tf
slim
=
tf
.
contrib
.
slim
import
gin.tf
@
gin
.
configurable
(
'ddpg_critic_net'
)
def
critic_net
(
states
,
actions
,
for_critic_loss
=
False
,
num_reward_dims
=
1
,
states_hidden_layers
=
(
400
,),
actions_hidden_layers
=
None
,
joint_hidden_layers
=
(
300
,),
weight_decay
=
0.0001
,
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
relu
,
zero_obs
=
False
,
images
=
False
):
"""Creates a critic that returns q values for the given states and actions.
Args:
states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
representing a batch of states.
actions: (castable to tf.float32) a [batch_size, num_action_dims] tensor
representing a batch of actions.
num_reward_dims: Number of reward dimensions.
states_hidden_layers: tuple of hidden layers units for states.
actions_hidden_layers: tuple of hidden layers units for actions.
joint_hidden_layers: tuple of hidden layers units after joining states
and actions using tf.concat().
weight_decay: Weight decay for l2 weights regularizer.
normalizer_fn: Normalizer function, i.e. slim.layer_norm,
activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
Returns:
A tf.float32 [batch_size] tensor of q values, or a tf.float32
[batch_size, num_reward_dims] tensor of vector q values if
num_reward_dims > 1.
"""
with
slim
.
arg_scope
(
[
slim
.
fully_connected
],
activation_fn
=
activation_fn
,
normalizer_fn
=
normalizer_fn
,
weights_regularizer
=
slim
.
l2_regularizer
(
weight_decay
),
weights_initializer
=
slim
.
variance_scaling_initializer
(
factor
=
1.0
/
3.0
,
mode
=
'FAN_IN'
,
uniform
=
True
)):
orig_states
=
tf
.
to_float
(
states
)
#states = tf.to_float(states)
states
=
tf
.
concat
([
tf
.
to_float
(
states
),
tf
.
to_float
(
actions
)],
-
1
)
#TD3
if
images
or
zero_obs
:
states
*=
tf
.
constant
([
0.0
]
*
2
+
[
1.0
]
*
(
states
.
shape
[
1
]
-
2
))
#LALA
actions
=
tf
.
to_float
(
actions
)
if
states_hidden_layers
:
states
=
slim
.
stack
(
states
,
slim
.
fully_connected
,
states_hidden_layers
,
scope
=
'states'
)
if
actions_hidden_layers
:
actions
=
slim
.
stack
(
actions
,
slim
.
fully_connected
,
actions_hidden_layers
,
scope
=
'actions'
)
joint
=
tf
.
concat
([
states
,
actions
],
1
)
if
joint_hidden_layers
:
joint
=
slim
.
stack
(
joint
,
slim
.
fully_connected
,
joint_hidden_layers
,
scope
=
'joint'
)
with
slim
.
arg_scope
([
slim
.
fully_connected
],
weights_regularizer
=
None
,
weights_initializer
=
tf
.
random_uniform_initializer
(
minval
=-
0.003
,
maxval
=
0.003
)):
value
=
slim
.
fully_connected
(
joint
,
num_reward_dims
,
activation_fn
=
None
,
normalizer_fn
=
None
,
scope
=
'q_value'
)
if
num_reward_dims
==
1
:
value
=
tf
.
reshape
(
value
,
[
-
1
])
if
not
for_critic_loss
and
num_reward_dims
>
1
:
value
=
tf
.
reduce_sum
(
value
*
tf
.
abs
(
orig_states
[:,
-
num_reward_dims
:]),
-
1
)
return
value
@
gin
.
configurable
(
'ddpg_actor_net'
)
def
actor_net
(
states
,
action_spec
,
hidden_layers
=
(
400
,
300
),
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
relu
,
zero_obs
=
False
,
images
=
False
):
"""Creates an actor that returns actions for the given states.
Args:
states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
representing a batch of states.
action_spec: (BoundedTensorSpec) A tensor spec indicating the shape
and range of actions.
hidden_layers: tuple of hidden layers units.
normalizer_fn: Normalizer function, i.e. slim.layer_norm,
activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
Returns:
A tf.float32 [batch_size, num_action_dims] tensor of actions.
"""
with
slim
.
arg_scope
(
[
slim
.
fully_connected
],
activation_fn
=
activation_fn
,
normalizer_fn
=
normalizer_fn
,
weights_initializer
=
slim
.
variance_scaling_initializer
(
factor
=
1.0
/
3.0
,
mode
=
'FAN_IN'
,
uniform
=
True
)):
states
=
tf
.
to_float
(
states
)
orig_states
=
states
if
images
or
zero_obs
:
# Zero-out x, y position. Hacky.
states
*=
tf
.
constant
([
0.0
]
*
2
+
[
1.0
]
*
(
states
.
shape
[
1
]
-
2
))
if
hidden_layers
:
states
=
slim
.
stack
(
states
,
slim
.
fully_connected
,
hidden_layers
,
scope
=
'states'
)
with
slim
.
arg_scope
([
slim
.
fully_connected
],
weights_initializer
=
tf
.
random_uniform_initializer
(
minval
=-
0.003
,
maxval
=
0.003
)):
actions
=
slim
.
fully_connected
(
states
,
action_spec
.
shape
.
num_elements
(),
scope
=
'actions'
,
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
tanh
)
action_means
=
(
action_spec
.
maximum
+
action_spec
.
minimum
)
/
2.0
action_magnitudes
=
(
action_spec
.
maximum
-
action_spec
.
minimum
)
/
2.0
actions
=
action_means
+
action_magnitudes
*
actions
return
actions
research/efficient-hrl/cond_fn.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Defines many boolean functions indicating when to step and reset.
"""
import
tensorflow
as
tf
import
gin.tf
@
gin
.
configurable
def
env_transition
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""True if the transition_type is TRANSITION or FINAL_TRANSITION.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: Returns an op that evaluates to true if the transition type is
not RESTARTING
"""
del
agent
,
state
,
action
,
num_episodes
,
environment_steps
cond
=
tf
.
logical_not
(
transition_type
)
return
cond
@
gin
.
configurable
def
env_restart
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""True if the transition_type is RESTARTING.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: Returns an op that evaluates to true if the transition type equals
RESTARTING.
"""
del
agent
,
state
,
action
,
num_episodes
,
environment_steps
cond
=
tf
.
identity
(
transition_type
)
return
cond
@
gin
.
configurable
def
every_n_steps
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
n
=
150
):
"""True once every n steps.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
n: Return true once every n steps.
Returns:
cond: Returns an op that evaluates to true if environment_steps
equals 0 mod n. We increment the step before checking this condition, so
we do not need to add one to environment_steps.
"""
del
agent
,
state
,
action
,
transition_type
,
num_episodes
cond
=
tf
.
equal
(
tf
.
mod
(
environment_steps
,
n
),
0
)
return
cond
@
gin
.
configurable
def
every_n_episodes
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
n
=
2
,
steps_per_episode
=
None
):
"""True once every n episodes.
Specifically, evaluates to True on the 0th step of every nth episode.
Unlike environment_steps, num_episodes starts at 0, so we do want to add
one to ensure it does not reset on the first call.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
n: Return true once every n episodes.
steps_per_episode: How many steps per episode. Needed to determine when a
new episode starts.
Returns:
cond: Returns an op that evaluates to true on the last step of the episode
(i.e. if num_episodes equals 0 mod n).
"""
assert
steps_per_episode
is
not
None
del
agent
,
action
,
transition_type
ant_fell
=
tf
.
logical_or
(
state
[
2
]
<
0.2
,
state
[
2
]
>
1.0
)
cond
=
tf
.
logical_and
(
tf
.
logical_or
(
ant_fell
,
tf
.
equal
(
tf
.
mod
(
num_episodes
+
1
,
n
),
0
)),
tf
.
equal
(
tf
.
mod
(
environment_steps
,
steps_per_episode
),
0
))
return
cond
@
gin
.
configurable
def
failed_reset_after_n_episodes
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
steps_per_episode
=
None
,
reset_state
=
None
,
max_dist
=
1.0
,
epsilon
=
1e-10
):
"""Every n episodes, returns True if the reset agent fails to return.
Specifically, evaluates to True if the distance between the state and the
reset state is greater than max_dist at the end of the episode.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
steps_per_episode: How many steps per episode. Needed to determine when a
new episode starts.
reset_state: State to which the reset controller should return.
max_dist: Agent is considered to have successfully reset if its distance
from the reset_state is less than max_dist.
epsilon: small offset to ensure non-negative/zero distance.
Returns:
cond: Returns an op that evaluates to true if num_episodes+1 equals 0
mod n. We add one to the num_episodes so the environment is not reset after
the 0th step.
"""
assert
steps_per_episode
is
not
None
assert
reset_state
is
not
None
del
agent
,
state
,
action
,
transition_type
,
num_episodes
dist
=
tf
.
sqrt
(
tf
.
reduce_sum
(
tf
.
squared_difference
(
state
,
reset_state
))
+
epsilon
)
cond
=
tf
.
logical_and
(
tf
.
greater
(
dist
,
tf
.
constant
(
max_dist
)),
tf
.
equal
(
tf
.
mod
(
environment_steps
,
steps_per_episode
),
0
))
return
cond
@
gin
.
configurable
def
q_too_small
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
q_min
=
0.5
):
"""True of q is too small.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
q_min: Returns true if the qval is less than q_min
Returns:
cond: Returns an op that evaluates to true if qval is less than q_min.
"""
del
transition_type
,
environment_steps
,
num_episodes
state_for_reset_agent
=
tf
.
stack
(
state
[:
-
1
],
tf
.
constant
([
0
],
dtype
=
tf
.
float
))
qval
=
agent
.
BASE_AGENT_CLASS
.
critic_net
(
tf
.
expand_dims
(
state_for_reset_agent
,
0
),
tf
.
expand_dims
(
action
,
0
))[
0
,
:]
cond
=
tf
.
greater
(
tf
.
constant
(
q_min
),
qval
)
return
cond
@
gin
.
configurable
def
true_fn
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""Returns an op that evaluates to true.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: op that always evaluates to True.
"""
del
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
cond
=
tf
.
constant
(
True
,
dtype
=
tf
.
bool
)
return
cond
@
gin
.
configurable
def
false_fn
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""Returns an op that evaluates to false.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: op that always evaluates to False.
"""
del
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
cond
=
tf
.
constant
(
False
,
dtype
=
tf
.
bool
)
return
cond
research/efficient-hrl/configs/base_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
import gin.tf.external_configurables
create_maze_env.top_down_view = %IMAGES
## Create the agent
AGENT_CLASS = @UvfAgent
UvfAgent.tf_context = %CONTEXT
UvfAgent.actor_net = @agent/ddpg_actor_net
UvfAgent.critic_net = @agent/ddpg_critic_net
UvfAgent.dqda_clipping = 0.0
UvfAgent.td_errors_loss = @tf.losses.huber_loss
UvfAgent.target_q_clipping = %TARGET_Q_CLIPPING
# Create meta agent
META_CLASS = @MetaAgent
MetaAgent.tf_context = %META_CONTEXT
MetaAgent.sub_context = %CONTEXT
MetaAgent.actor_net = @meta/ddpg_actor_net
MetaAgent.critic_net = @meta/ddpg_critic_net
MetaAgent.dqda_clipping = 0.0
MetaAgent.td_errors_loss = @tf.losses.huber_loss
MetaAgent.target_q_clipping = %TARGET_Q_CLIPPING
# Create state preprocess
STATE_PREPROCESS_CLASS = @StatePreprocess
StatePreprocess.ndims = %SUBGOAL_DIM
state_preprocess_net.states_hidden_layers = (100, 100)
state_preprocess_net.num_output_dims = %SUBGOAL_DIM
state_preprocess_net.images = %IMAGES
action_embed_net.num_output_dims = %SUBGOAL_DIM
INVERSE_DYNAMICS_CLASS = @InverseDynamics
# actor_net
ACTOR_HIDDEN_SIZE_1 = 300
ACTOR_HIDDEN_SIZE_2 = 300
agent/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
agent/ddpg_actor_net.activation_fn = @tf.nn.relu
agent/ddpg_actor_net.zero_obs = %ZERO_OBS
agent/ddpg_actor_net.images = %IMAGES
meta/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
meta/ddpg_actor_net.activation_fn = @tf.nn.relu
meta/ddpg_actor_net.zero_obs = False
meta/ddpg_actor_net.images = %IMAGES
# critic_net
CRITIC_HIDDEN_SIZE_1 = 300
CRITIC_HIDDEN_SIZE_2 = 300
agent/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
agent/ddpg_critic_net.actions_hidden_layers = None
agent/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
agent/ddpg_critic_net.weight_decay = 0.0
agent/ddpg_critic_net.activation_fn = @tf.nn.relu
agent/ddpg_critic_net.zero_obs = %ZERO_OBS
agent/ddpg_critic_net.images = %IMAGES
meta/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
meta/ddpg_critic_net.actions_hidden_layers = None
meta/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
meta/ddpg_critic_net.weight_decay = 0.0
meta/ddpg_critic_net.activation_fn = @tf.nn.relu
meta/ddpg_critic_net.zero_obs = False
meta/ddpg_critic_net.images = %IMAGES
tf.losses.huber_loss.delta = 1.0
# Sample action
uvf_add_noise_fn.stddev = 1.0
meta_add_noise_fn.stddev = %META_EXPLORE_NOISE
# Update targets
ddpg_update_targets.tau = 0.001
td3_update_targets.tau = 0.005
research/efficient-hrl/configs/eval_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
# Config eval
evaluate.environment = @create_maze_env()
evaluate.agent_class = %AGENT_CLASS
evaluate.meta_agent_class = %META_CLASS
evaluate.state_preprocess_class = %STATE_PREPROCESS_CLASS
evaluate.num_episodes_eval = 50
evaluate.num_episodes_videos = 1
evaluate.gamma = 1.0
evaluate.eval_interval_secs = 1
evaluate.generate_videos = False
evaluate.generate_summaries = True
evaluate.eval_modes = %EVAL_MODES
evaluate.max_steps_per_episode = %RESET_EPISODE_PERIOD
research/efficient-hrl/configs/train_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
# Create replay_buffer
agent/CircularBuffer.buffer_size = 200000
meta/CircularBuffer.buffer_size = 200000
agent/CircularBuffer.scope = "agent"
meta/CircularBuffer.scope = "meta"
# Config train
train_uvf.environment = @create_maze_env()
train_uvf.agent_class = %AGENT_CLASS
train_uvf.meta_agent_class = %META_CLASS
train_uvf.state_preprocess_class = %STATE_PREPROCESS_CLASS
train_uvf.inverse_dynamics_class = %INVERSE_DYNAMICS_CLASS
train_uvf.replay_buffer = @agent/CircularBuffer()
train_uvf.meta_replay_buffer = @meta/CircularBuffer()
train_uvf.critic_optimizer = @critic/AdamOptimizer()
train_uvf.actor_optimizer = @actor/AdamOptimizer()
train_uvf.meta_critic_optimizer = @meta_critic/AdamOptimizer()
train_uvf.meta_actor_optimizer = @meta_actor/AdamOptimizer()
train_uvf.repr_optimizer = @repr/AdamOptimizer()
train_uvf.num_episodes_train = 25000
train_uvf.batch_size = 100
train_uvf.initial_episodes = 5
train_uvf.gamma = 0.99
train_uvf.meta_gamma = 0.99
train_uvf.reward_scale_factor = 1.0
train_uvf.target_update_period = 2
train_uvf.num_updates_per_observation = 1
train_uvf.num_collect_per_update = 1
train_uvf.num_collect_per_meta_update = 10
train_uvf.debug_summaries = False
train_uvf.log_every_n_steps = 1000
train_uvf.save_policy_every_n_steps =100000
# Config Optimizers
critic/AdamOptimizer.learning_rate = 0.001
critic/AdamOptimizer.beta1 = 0.9
critic/AdamOptimizer.beta2 = 0.999
actor/AdamOptimizer.learning_rate = 0.0001
actor/AdamOptimizer.beta1 = 0.9
actor/AdamOptimizer.beta2 = 0.999
meta_critic/AdamOptimizer.learning_rate = 0.001
meta_critic/AdamOptimizer.beta1 = 0.9
meta_critic/AdamOptimizer.beta2 = 0.999
meta_actor/AdamOptimizer.learning_rate = 0.0001
meta_actor/AdamOptimizer.beta1 = 0.9
meta_actor/AdamOptimizer.beta2 = 0.999
repr/AdamOptimizer.learning_rate = 0.0001
repr/AdamOptimizer.beta1 = 0.9
repr/AdamOptimizer.beta2 = 0.999
research/efficient-hrl/context/__init__.py
0 → 100644
View file @
052361de
research/efficient-hrl/context/configs/ant_block.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntBlock"
ZERO_OBS = False
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [3, 4]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_block_maze.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntBlockMaze"
ZERO_OBS = False
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (12, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [3, 4]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [8, 0]
eval2/ConstantSampler.value = [8, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_fall_multi.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 4.5]
research/efficient-hrl/context/configs/ant_fall_multi_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 3
task/relative_context_multi_transition_fn.k = 3
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 0]
research/efficient-hrl/context/configs/ant_fall_single.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@eval1/ConstantSampler],
"explore": [@eval1/ConstantSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 4.5]
research/efficient-hrl/context/configs/ant_maze.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntMaze"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_maze_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntMaze"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 2
task/relative_context_multi_transition_fn.k = 2
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_push_multi.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntPush"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-16, -4), (16, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval2"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval2": [@eval2/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval2/ConstantSampler.value = [0, 19]
research/efficient-hrl/context/configs/ant_push_multi_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntPush"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-16, -4), (16, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval2"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval2": [@eval2/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 2
task/relative_context_multi_transition_fn.k = 2
MetaAgent.k = %SUBGOAL_DIM
eval2/ConstantSampler.value = [0, 19]
Prev
1
2
3
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment