Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
052361de
Commit
052361de
authored
Dec 05, 2018
by
ofirnachum
Browse files
add training code
parent
9b969ca5
Changes
51
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
2968 additions
and
8 deletions
+2968
-8
research/efficient-hrl/README.md
research/efficient-hrl/README.md
+42
-8
research/efficient-hrl/agent.py
research/efficient-hrl/agent.py
+774
-0
research/efficient-hrl/agents/__init__.py
research/efficient-hrl/agents/__init__.py
+1
-0
research/efficient-hrl/agents/circular_buffer.py
research/efficient-hrl/agents/circular_buffer.py
+289
-0
research/efficient-hrl/agents/ddpg_agent.py
research/efficient-hrl/agents/ddpg_agent.py
+739
-0
research/efficient-hrl/agents/ddpg_networks.py
research/efficient-hrl/agents/ddpg_networks.py
+150
-0
research/efficient-hrl/cond_fn.py
research/efficient-hrl/cond_fn.py
+244
-0
research/efficient-hrl/configs/base_uvf.gin
research/efficient-hrl/configs/base_uvf.gin
+68
-0
research/efficient-hrl/configs/eval_uvf.gin
research/efficient-hrl/configs/eval_uvf.gin
+14
-0
research/efficient-hrl/configs/train_uvf.gin
research/efficient-hrl/configs/train_uvf.gin
+52
-0
research/efficient-hrl/context/__init__.py
research/efficient-hrl/context/__init__.py
+1
-0
research/efficient-hrl/context/configs/ant_block.gin
research/efficient-hrl/context/configs/ant_block.gin
+67
-0
research/efficient-hrl/context/configs/ant_block_maze.gin
research/efficient-hrl/context/configs/ant_block_maze.gin
+67
-0
research/efficient-hrl/context/configs/ant_fall_multi.gin
research/efficient-hrl/context/configs/ant_fall_multi.gin
+62
-0
research/efficient-hrl/context/configs/ant_fall_multi_img.gin
...arch/efficient-hrl/context/configs/ant_fall_multi_img.gin
+68
-0
research/efficient-hrl/context/configs/ant_fall_single.gin
research/efficient-hrl/context/configs/ant_fall_single.gin
+62
-0
research/efficient-hrl/context/configs/ant_maze.gin
research/efficient-hrl/context/configs/ant_maze.gin
+66
-0
research/efficient-hrl/context/configs/ant_maze_img.gin
research/efficient-hrl/context/configs/ant_maze_img.gin
+72
-0
research/efficient-hrl/context/configs/ant_push_multi.gin
research/efficient-hrl/context/configs/ant_push_multi.gin
+62
-0
research/efficient-hrl/context/configs/ant_push_multi_img.gin
...arch/efficient-hrl/context/configs/ant_push_multi_img.gin
+68
-0
No files found.
research/efficient-hrl/README.md
View file @
052361de
Code for performing Hierarchical RL based on
Code for performing Hierarchical RL based on the following publications:
"Data-Efficient Hierarchical Reinforcement Learning" by
"Data-Efficient Hierarchical Reinforcement Learning" by
Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
(https://arxiv.org/abs/1805.08296).
(https://arxiv.org/abs/1805.08296).
"Near-Optimal Representation Learning for Hierarchical Reinforcement Learning"
This library currently includes three of the environments used:
by Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, and Sergey Levine
Ant Maze, Ant Push, and Ant Fall.
(https://arxiv.org/abs/1810.01257).
The training code is planned to be open-sourced at a later time.
Requirements:
Requirements:
*
TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
*
TensorFlow (see http://www.tensorflow.org for how to install/upgrade)
*
Gin Config (see https://github.com/google/gin-config)
*
Tensorflow Agents (see https://github.com/tensorflow/agents)
*
OpenAI Gym (see http://gym.openai.com/docs, be sure to install MuJoCo as well)
*
OpenAI Gym (see http://gym.openai.com/docs, be sure to install MuJoCo as well)
*
NumPy (see http://www.numpy.org/)
*
NumPy (see http://www.numpy.org/)
Quick Start:
Quick Start:
Run a random policy on AntMaze (or AntPush, AntFall):
Run a training job based on the original HIRO paper on Ant Maze:
```
python scripts/local_train.py test1 hiro_orig ant_maze base_uvf suite
```
Run a continuous evaluation job for that experiment:
```
```
python
environments/__init__.py --env=AntMaz
e
python
scripts/local_eval.py test1 hiro_orig ant_maze base_uvf suit
e
```
```
To run the same experiment with online representation learning (the
"Near-Optimal" paper), change
`hiro_orig`
to
`hiro_repr`
.
You can also run with
`hiro_xy`
to run the same experiment with HIRO on only the
xy coordinates of the agent.
To run on other environments, change
`ant_maze`
to something else; e.g.,
`ant_push_multi`
,
`ant_fall_multi`
, etc. See
`context/configs/*`
for other options.
Basic Code Guide:
The code for training resides in train.py. The code trains a lower-level policy
(a UVF agent in the code) and a higher-level policy (a MetaAgent in the code)
concurrently. The higher-level policy communicates goals to the lower-level
policy. In the code, this is called a context. Not only does the lower-level
policy act with respect to a context (a higher-level specified goal), but the
higher-level policy also acts with respect to an environment-specified context
(corresponding to the navigation target location associated with the task).
Therefore, in
`context/configs/*`
you will find both specifications for task setup
as well as goal configurations. Most remaining hyperparameters used for
training/evaluation may be found in
`configs/*`
.
NOTE: Not all the code corresponding to the "Near-Optimal" paper is included.
Namely, changes to low-level policy training proposed in the paper (discounting
and auxiliary rewards) are not implemented here. Performance should not change
significantly.
Maintained by Ofir Nachum (ofirnachum).
Maintained by Ofir Nachum (ofirnachum).
research/efficient-hrl/agent.py
0 → 100644
View file @
052361de
This diff is collapsed.
Click to expand it.
research/efficient-hrl/agents/__init__.py
0 → 100644
View file @
052361de
research/efficient-hrl/agents/circular_buffer.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A circular buffer where each element is a list of tensors.
Each element of the buffer is a list of tensors. An example use case is a replay
buffer in reinforcement learning, where each element is a list of tensors
representing the state, action, reward etc.
New elements are added sequentially, and once the buffer is full, we
start overwriting them in a circular fashion. Reading does not remove any
elements, only adding new elements does.
"""
import
collections
import
numpy
as
np
import
tensorflow
as
tf
import
gin.tf
@
gin
.
configurable
class
CircularBuffer
(
object
):
"""A circular buffer where each element is a list of tensors."""
def
__init__
(
self
,
buffer_size
=
1000
,
scope
=
'replay_buffer'
):
"""Circular buffer of list of tensors.
Args:
buffer_size: (integer) maximum number of tensor lists the buffer can hold.
scope: (string) variable scope for creating the variables.
"""
self
.
_buffer_size
=
np
.
int64
(
buffer_size
)
self
.
_scope
=
scope
self
.
_tensors
=
collections
.
OrderedDict
()
with
tf
.
variable_scope
(
self
.
_scope
):
self
.
_num_adds
=
tf
.
Variable
(
0
,
dtype
=
tf
.
int64
,
name
=
'num_adds'
)
self
.
_num_adds_cs
=
tf
.
contrib
.
framework
.
CriticalSection
(
name
=
'num_adds'
)
@
property
def
buffer_size
(
self
):
return
self
.
_buffer_size
@
property
def
scope
(
self
):
return
self
.
_scope
@
property
def
num_adds
(
self
):
return
self
.
_num_adds
def
_create_variables
(
self
,
tensors
):
with
tf
.
variable_scope
(
self
.
_scope
):
for
name
in
tensors
.
keys
():
tensor
=
tensors
[
name
]
self
.
_tensors
[
name
]
=
tf
.
get_variable
(
name
=
'BufferVariable_'
+
name
,
shape
=
[
self
.
_buffer_size
]
+
tensor
.
get_shape
().
as_list
(),
dtype
=
tensor
.
dtype
,
trainable
=
False
)
def
_validate
(
self
,
tensors
):
"""Validate shapes of tensors."""
if
len
(
tensors
)
!=
len
(
self
.
_tensors
):
raise
ValueError
(
'Expected tensors to have %d elements. Received %d '
'instead.'
%
(
len
(
self
.
_tensors
),
len
(
tensors
)))
if
self
.
_tensors
.
keys
()
!=
tensors
.
keys
():
raise
ValueError
(
'The keys of tensors should be the always the same.'
'Received %s instead %s.'
%
(
tensors
.
keys
(),
self
.
_tensors
.
keys
()))
for
name
,
tensor
in
tensors
.
items
():
if
tensor
.
get_shape
().
as_list
()
!=
self
.
_tensors
[
name
].
get_shape
().
as_list
()[
1
:]:
raise
ValueError
(
'Tensor %s has incorrect shape.'
%
name
)
if
not
tensor
.
dtype
.
is_compatible_with
(
self
.
_tensors
[
name
].
dtype
):
raise
ValueError
(
'Tensor %s has incorrect data type. Expected %s, received %s'
%
(
name
,
self
.
_tensors
[
name
].
read_value
().
dtype
,
tensor
.
dtype
))
def
add
(
self
,
tensors
):
"""Adds an element (list/tuple/dict of tensors) to the buffer.
Args:
tensors: (list/tuple/dict of tensors) to be added to the buffer.
Returns:
An add operation that adds the input `tensors` to the buffer. Similar to
an enqueue_op.
Raises:
ValueError: If the shapes and data types of input `tensors' are not the
same across calls to the add function.
"""
return
self
.
maybe_add
(
tensors
,
True
)
def
maybe_add
(
self
,
tensors
,
condition
):
"""Adds an element (tensors) to the buffer based on the condition..
Args:
tensors: (list/tuple of tensors) to be added to the buffer.
condition: A boolean Tensor controlling whether the tensors would be added
to the buffer or not.
Returns:
An add operation that adds the input `tensors` to the buffer. Similar to
an maybe_enqueue_op.
Raises:
ValueError: If the shapes and data types of input `tensors' are not the
same across calls to the add function.
"""
if
not
isinstance
(
tensors
,
dict
):
names
=
[
str
(
i
)
for
i
in
range
(
len
(
tensors
))]
tensors
=
collections
.
OrderedDict
(
zip
(
names
,
tensors
))
if
not
isinstance
(
tensors
,
collections
.
OrderedDict
):
tensors
=
collections
.
OrderedDict
(
sorted
(
tensors
.
items
(),
key
=
lambda
t
:
t
[
0
]))
if
not
self
.
_tensors
:
self
.
_create_variables
(
tensors
)
else
:
self
.
_validate
(
tensors
)
#@tf.critical_section(self._position_mutex)
def
_increment_num_adds
():
# Adding 0 to the num_adds variable is a trick to read the value of the
# variable and return a read-only tensor. Doing this in a critical
# section allows us to capture a snapshot of the variable that will
# not be affected by other threads updating num_adds.
return
self
.
_num_adds
.
assign_add
(
1
)
+
0
def
_add
():
num_adds_inc
=
self
.
_num_adds_cs
.
execute
(
_increment_num_adds
)
current_pos
=
tf
.
mod
(
num_adds_inc
-
1
,
self
.
_buffer_size
)
update_ops
=
[]
for
name
in
self
.
_tensors
.
keys
():
update_ops
.
append
(
tf
.
scatter_update
(
self
.
_tensors
[
name
],
current_pos
,
tensors
[
name
]))
return
tf
.
group
(
*
update_ops
)
return
tf
.
contrib
.
framework
.
smart_cond
(
condition
,
_add
,
tf
.
no_op
)
def
get_random_batch
(
self
,
batch_size
,
keys
=
None
,
num_steps
=
1
):
"""Samples a batch of tensors from the buffer with replacement.
Args:
batch_size: (integer) number of elements to sample.
keys: List of keys of tensors to retrieve. If None retrieve all.
num_steps: (integer) length of trajectories to return. If > 1 will return
a list of lists, where each internal list represents a trajectory of
length num_steps.
Returns:
A list of tensors, where each element in the list is a batch sampled from
one of the tensors in the buffer.
Raises:
ValueError: If get_random_batch is called before calling the add function.
tf.errors.InvalidArgumentError: If this operation is executed before any
items are added to the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before get_random_batch.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
latest_start_index
=
self
.
get_num_adds
()
-
num_steps
+
1
empty_buffer_assert
=
tf
.
Assert
(
tf
.
greater
(
latest_start_index
,
0
),
[
'Not enough elements have been added to the buffer.'
])
with
tf
.
control_dependencies
([
empty_buffer_assert
]):
max_index
=
tf
.
minimum
(
self
.
_buffer_size
,
latest_start_index
)
indices
=
tf
.
random_uniform
(
[
batch_size
],
minval
=
0
,
maxval
=
max_index
,
dtype
=
tf
.
int64
)
if
num_steps
==
1
:
return
self
.
gather
(
indices
,
keys
)
else
:
return
self
.
gather_nstep
(
num_steps
,
indices
,
keys
)
def
gather
(
self
,
indices
,
keys
=
None
):
"""Returns elements at the specified indices from the buffer.
Args:
indices: (list of integers or rank 1 int Tensor) indices in the buffer to
retrieve elements from.
keys: List of keys of tensors to retrieve. If None retrieve all.
Returns:
A list of tensors, where each element in the list is obtained by indexing
one of the tensors in the buffer.
Raises:
ValueError: If gather is called before calling the add function.
tf.errors.InvalidArgumentError: If indices are bigger than the number of
items in the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before calling gather.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
with
tf
.
name_scope
(
'Gather'
):
index_bound_assert
=
tf
.
Assert
(
tf
.
less
(
tf
.
to_int64
(
tf
.
reduce_max
(
indices
)),
tf
.
minimum
(
self
.
get_num_adds
(),
self
.
_buffer_size
)),
[
'Index out of bounds.'
])
with
tf
.
control_dependencies
([
index_bound_assert
]):
indices
=
tf
.
convert_to_tensor
(
indices
)
batch
=
[]
for
key
in
keys
:
batch
.
append
(
tf
.
gather
(
self
.
_tensors
[
key
],
indices
,
name
=
key
))
return
batch
def
gather_nstep
(
self
,
num_steps
,
indices
,
keys
=
None
):
"""Returns elements at the specified indices from the buffer.
Args:
num_steps: (integer) length of trajectories to return.
indices: (list of rank num_steps int Tensor) indices in the buffer to
retrieve elements from for multiple trajectories. Each Tensor in the
list represents the indices for a trajectory.
keys: List of keys of tensors to retrieve. If None retrieve all.
Returns:
A list of list-of-tensors, where each element in the list is obtained by
indexing one of the tensors in the buffer.
Raises:
ValueError: If gather is called before calling the add function.
tf.errors.InvalidArgumentError: If indices are bigger than the number of
items in the buffer.
"""
if
not
self
.
_tensors
:
raise
ValueError
(
'The add function must be called before calling gather.'
)
if
keys
is
None
:
keys
=
self
.
_tensors
.
keys
()
with
tf
.
name_scope
(
'Gather'
):
index_bound_assert
=
tf
.
Assert
(
tf
.
less_equal
(
tf
.
to_int64
(
tf
.
reduce_max
(
indices
)
+
num_steps
),
self
.
get_num_adds
()),
[
'Trajectory indices go out of bounds.'
])
with
tf
.
control_dependencies
([
index_bound_assert
]):
indices
=
tf
.
map_fn
(
lambda
x
:
tf
.
mod
(
tf
.
range
(
x
,
x
+
num_steps
),
self
.
_buffer_size
),
indices
,
dtype
=
tf
.
int64
)
batch
=
[]
for
key
in
keys
:
def
SampleTrajectories
(
trajectory_indices
,
key
=
key
,
num_steps
=
num_steps
):
trajectory_indices
.
set_shape
([
num_steps
])
return
tf
.
gather
(
self
.
_tensors
[
key
],
trajectory_indices
,
name
=
key
)
batch
.
append
(
tf
.
map_fn
(
SampleTrajectories
,
indices
,
dtype
=
self
.
_tensors
[
key
].
dtype
))
return
batch
def
get_position
(
self
):
"""Returns the position at which the last element was added.
Returns:
An int tensor representing the index at which the last element was added
to the buffer or -1 if no elements were added.
"""
return
tf
.
cond
(
self
.
get_num_adds
()
<
1
,
lambda
:
self
.
get_num_adds
()
-
1
,
lambda
:
tf
.
mod
(
self
.
get_num_adds
()
-
1
,
self
.
_buffer_size
))
def
get_num_adds
(
self
):
"""Returns the number of additions to the buffer.
Returns:
An int tensor representing the number of elements that were added.
"""
def
num_adds
():
return
self
.
_num_adds
.
value
()
return
self
.
_num_adds_cs
.
execute
(
num_adds
)
def
get_num_tensors
(
self
):
"""Returns the number of tensors (slots) in the buffer."""
return
len
(
self
.
_tensors
)
research/efficient-hrl/agents/ddpg_agent.py
0 → 100644
View file @
052361de
This diff is collapsed.
Click to expand it.
research/efficient-hrl/agents/ddpg_networks.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Sample actor(policy) and critic(q) networks to use with DDPG/NAF agents.
The DDPG networks are defined in "Section 7: Experiment Details" of
"Continuous control with deep reinforcement learning" - Lilicrap et al.
https://arxiv.org/abs/1509.02971
The NAF critic network is based on "Section 4" of "Continuous deep Q-learning
with model-based acceleration" - Gu et al. https://arxiv.org/pdf/1603.00748.
"""
import
tensorflow
as
tf
slim
=
tf
.
contrib
.
slim
import
gin.tf
@
gin
.
configurable
(
'ddpg_critic_net'
)
def
critic_net
(
states
,
actions
,
for_critic_loss
=
False
,
num_reward_dims
=
1
,
states_hidden_layers
=
(
400
,),
actions_hidden_layers
=
None
,
joint_hidden_layers
=
(
300
,),
weight_decay
=
0.0001
,
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
relu
,
zero_obs
=
False
,
images
=
False
):
"""Creates a critic that returns q values for the given states and actions.
Args:
states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
representing a batch of states.
actions: (castable to tf.float32) a [batch_size, num_action_dims] tensor
representing a batch of actions.
num_reward_dims: Number of reward dimensions.
states_hidden_layers: tuple of hidden layers units for states.
actions_hidden_layers: tuple of hidden layers units for actions.
joint_hidden_layers: tuple of hidden layers units after joining states
and actions using tf.concat().
weight_decay: Weight decay for l2 weights regularizer.
normalizer_fn: Normalizer function, i.e. slim.layer_norm,
activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
Returns:
A tf.float32 [batch_size] tensor of q values, or a tf.float32
[batch_size, num_reward_dims] tensor of vector q values if
num_reward_dims > 1.
"""
with
slim
.
arg_scope
(
[
slim
.
fully_connected
],
activation_fn
=
activation_fn
,
normalizer_fn
=
normalizer_fn
,
weights_regularizer
=
slim
.
l2_regularizer
(
weight_decay
),
weights_initializer
=
slim
.
variance_scaling_initializer
(
factor
=
1.0
/
3.0
,
mode
=
'FAN_IN'
,
uniform
=
True
)):
orig_states
=
tf
.
to_float
(
states
)
#states = tf.to_float(states)
states
=
tf
.
concat
([
tf
.
to_float
(
states
),
tf
.
to_float
(
actions
)],
-
1
)
#TD3
if
images
or
zero_obs
:
states
*=
tf
.
constant
([
0.0
]
*
2
+
[
1.0
]
*
(
states
.
shape
[
1
]
-
2
))
#LALA
actions
=
tf
.
to_float
(
actions
)
if
states_hidden_layers
:
states
=
slim
.
stack
(
states
,
slim
.
fully_connected
,
states_hidden_layers
,
scope
=
'states'
)
if
actions_hidden_layers
:
actions
=
slim
.
stack
(
actions
,
slim
.
fully_connected
,
actions_hidden_layers
,
scope
=
'actions'
)
joint
=
tf
.
concat
([
states
,
actions
],
1
)
if
joint_hidden_layers
:
joint
=
slim
.
stack
(
joint
,
slim
.
fully_connected
,
joint_hidden_layers
,
scope
=
'joint'
)
with
slim
.
arg_scope
([
slim
.
fully_connected
],
weights_regularizer
=
None
,
weights_initializer
=
tf
.
random_uniform_initializer
(
minval
=-
0.003
,
maxval
=
0.003
)):
value
=
slim
.
fully_connected
(
joint
,
num_reward_dims
,
activation_fn
=
None
,
normalizer_fn
=
None
,
scope
=
'q_value'
)
if
num_reward_dims
==
1
:
value
=
tf
.
reshape
(
value
,
[
-
1
])
if
not
for_critic_loss
and
num_reward_dims
>
1
:
value
=
tf
.
reduce_sum
(
value
*
tf
.
abs
(
orig_states
[:,
-
num_reward_dims
:]),
-
1
)
return
value
@
gin
.
configurable
(
'ddpg_actor_net'
)
def
actor_net
(
states
,
action_spec
,
hidden_layers
=
(
400
,
300
),
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
relu
,
zero_obs
=
False
,
images
=
False
):
"""Creates an actor that returns actions for the given states.
Args:
states: (castable to tf.float32) a [batch_size, num_state_dims] tensor
representing a batch of states.
action_spec: (BoundedTensorSpec) A tensor spec indicating the shape
and range of actions.
hidden_layers: tuple of hidden layers units.
normalizer_fn: Normalizer function, i.e. slim.layer_norm,
activation_fn: Activation function, i.e. tf.nn.relu, slim.leaky_relu, ...
Returns:
A tf.float32 [batch_size, num_action_dims] tensor of actions.
"""
with
slim
.
arg_scope
(
[
slim
.
fully_connected
],
activation_fn
=
activation_fn
,
normalizer_fn
=
normalizer_fn
,
weights_initializer
=
slim
.
variance_scaling_initializer
(
factor
=
1.0
/
3.0
,
mode
=
'FAN_IN'
,
uniform
=
True
)):
states
=
tf
.
to_float
(
states
)
orig_states
=
states
if
images
or
zero_obs
:
# Zero-out x, y position. Hacky.
states
*=
tf
.
constant
([
0.0
]
*
2
+
[
1.0
]
*
(
states
.
shape
[
1
]
-
2
))
if
hidden_layers
:
states
=
slim
.
stack
(
states
,
slim
.
fully_connected
,
hidden_layers
,
scope
=
'states'
)
with
slim
.
arg_scope
([
slim
.
fully_connected
],
weights_initializer
=
tf
.
random_uniform_initializer
(
minval
=-
0.003
,
maxval
=
0.003
)):
actions
=
slim
.
fully_connected
(
states
,
action_spec
.
shape
.
num_elements
(),
scope
=
'actions'
,
normalizer_fn
=
None
,
activation_fn
=
tf
.
nn
.
tanh
)
action_means
=
(
action_spec
.
maximum
+
action_spec
.
minimum
)
/
2.0
action_magnitudes
=
(
action_spec
.
maximum
-
action_spec
.
minimum
)
/
2.0
actions
=
action_means
+
action_magnitudes
*
actions
return
actions
research/efficient-hrl/cond_fn.py
0 → 100644
View file @
052361de
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Defines many boolean functions indicating when to step and reset.
"""
import
tensorflow
as
tf
import
gin.tf
@
gin
.
configurable
def
env_transition
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""True if the transition_type is TRANSITION or FINAL_TRANSITION.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: Returns an op that evaluates to true if the transition type is
not RESTARTING
"""
del
agent
,
state
,
action
,
num_episodes
,
environment_steps
cond
=
tf
.
logical_not
(
transition_type
)
return
cond
@
gin
.
configurable
def
env_restart
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""True if the transition_type is RESTARTING.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: Returns an op that evaluates to true if the transition type equals
RESTARTING.
"""
del
agent
,
state
,
action
,
num_episodes
,
environment_steps
cond
=
tf
.
identity
(
transition_type
)
return
cond
@
gin
.
configurable
def
every_n_steps
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
n
=
150
):
"""True once every n steps.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
n: Return true once every n steps.
Returns:
cond: Returns an op that evaluates to true if environment_steps
equals 0 mod n. We increment the step before checking this condition, so
we do not need to add one to environment_steps.
"""
del
agent
,
state
,
action
,
transition_type
,
num_episodes
cond
=
tf
.
equal
(
tf
.
mod
(
environment_steps
,
n
),
0
)
return
cond
@
gin
.
configurable
def
every_n_episodes
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
n
=
2
,
steps_per_episode
=
None
):
"""True once every n episodes.
Specifically, evaluates to True on the 0th step of every nth episode.
Unlike environment_steps, num_episodes starts at 0, so we do want to add
one to ensure it does not reset on the first call.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
n: Return true once every n episodes.
steps_per_episode: How many steps per episode. Needed to determine when a
new episode starts.
Returns:
cond: Returns an op that evaluates to true on the last step of the episode
(i.e. if num_episodes equals 0 mod n).
"""
assert
steps_per_episode
is
not
None
del
agent
,
action
,
transition_type
ant_fell
=
tf
.
logical_or
(
state
[
2
]
<
0.2
,
state
[
2
]
>
1.0
)
cond
=
tf
.
logical_and
(
tf
.
logical_or
(
ant_fell
,
tf
.
equal
(
tf
.
mod
(
num_episodes
+
1
,
n
),
0
)),
tf
.
equal
(
tf
.
mod
(
environment_steps
,
steps_per_episode
),
0
))
return
cond
@
gin
.
configurable
def
failed_reset_after_n_episodes
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
steps_per_episode
=
None
,
reset_state
=
None
,
max_dist
=
1.0
,
epsilon
=
1e-10
):
"""Every n episodes, returns True if the reset agent fails to return.
Specifically, evaluates to True if the distance between the state and the
reset state is greater than max_dist at the end of the episode.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
steps_per_episode: How many steps per episode. Needed to determine when a
new episode starts.
reset_state: State to which the reset controller should return.
max_dist: Agent is considered to have successfully reset if its distance
from the reset_state is less than max_dist.
epsilon: small offset to ensure non-negative/zero distance.
Returns:
cond: Returns an op that evaluates to true if num_episodes+1 equals 0
mod n. We add one to the num_episodes so the environment is not reset after
the 0th step.
"""
assert
steps_per_episode
is
not
None
assert
reset_state
is
not
None
del
agent
,
state
,
action
,
transition_type
,
num_episodes
dist
=
tf
.
sqrt
(
tf
.
reduce_sum
(
tf
.
squared_difference
(
state
,
reset_state
))
+
epsilon
)
cond
=
tf
.
logical_and
(
tf
.
greater
(
dist
,
tf
.
constant
(
max_dist
)),
tf
.
equal
(
tf
.
mod
(
environment_steps
,
steps_per_episode
),
0
))
return
cond
@
gin
.
configurable
def
q_too_small
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
,
q_min
=
0.5
):
"""True of q is too small.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
q_min: Returns true if the qval is less than q_min
Returns:
cond: Returns an op that evaluates to true if qval is less than q_min.
"""
del
transition_type
,
environment_steps
,
num_episodes
state_for_reset_agent
=
tf
.
stack
(
state
[:
-
1
],
tf
.
constant
([
0
],
dtype
=
tf
.
float
))
qval
=
agent
.
BASE_AGENT_CLASS
.
critic_net
(
tf
.
expand_dims
(
state_for_reset_agent
,
0
),
tf
.
expand_dims
(
action
,
0
))[
0
,
:]
cond
=
tf
.
greater
(
tf
.
constant
(
q_min
),
qval
)
return
cond
@
gin
.
configurable
def
true_fn
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""Returns an op that evaluates to true.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: op that always evaluates to True.
"""
del
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
cond
=
tf
.
constant
(
True
,
dtype
=
tf
.
bool
)
return
cond
@
gin
.
configurable
def
false_fn
(
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
):
"""Returns an op that evaluates to false.
Args:
agent: RL agent.
state: A [num_state_dims] tensor representing a state.
action: Action performed.
transition_type: Type of transition after action
environment_steps: Number of steps performed by environment.
num_episodes: Number of episodes.
Returns:
cond: op that always evaluates to False.
"""
del
agent
,
state
,
action
,
transition_type
,
environment_steps
,
num_episodes
cond
=
tf
.
constant
(
False
,
dtype
=
tf
.
bool
)
return
cond
research/efficient-hrl/configs/base_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
import gin.tf.external_configurables
create_maze_env.top_down_view = %IMAGES
## Create the agent
AGENT_CLASS = @UvfAgent
UvfAgent.tf_context = %CONTEXT
UvfAgent.actor_net = @agent/ddpg_actor_net
UvfAgent.critic_net = @agent/ddpg_critic_net
UvfAgent.dqda_clipping = 0.0
UvfAgent.td_errors_loss = @tf.losses.huber_loss
UvfAgent.target_q_clipping = %TARGET_Q_CLIPPING
# Create meta agent
META_CLASS = @MetaAgent
MetaAgent.tf_context = %META_CONTEXT
MetaAgent.sub_context = %CONTEXT
MetaAgent.actor_net = @meta/ddpg_actor_net
MetaAgent.critic_net = @meta/ddpg_critic_net
MetaAgent.dqda_clipping = 0.0
MetaAgent.td_errors_loss = @tf.losses.huber_loss
MetaAgent.target_q_clipping = %TARGET_Q_CLIPPING
# Create state preprocess
STATE_PREPROCESS_CLASS = @StatePreprocess
StatePreprocess.ndims = %SUBGOAL_DIM
state_preprocess_net.states_hidden_layers = (100, 100)
state_preprocess_net.num_output_dims = %SUBGOAL_DIM
state_preprocess_net.images = %IMAGES
action_embed_net.num_output_dims = %SUBGOAL_DIM
INVERSE_DYNAMICS_CLASS = @InverseDynamics
# actor_net
ACTOR_HIDDEN_SIZE_1 = 300
ACTOR_HIDDEN_SIZE_2 = 300
agent/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
agent/ddpg_actor_net.activation_fn = @tf.nn.relu
agent/ddpg_actor_net.zero_obs = %ZERO_OBS
agent/ddpg_actor_net.images = %IMAGES
meta/ddpg_actor_net.hidden_layers = (%ACTOR_HIDDEN_SIZE_1, %ACTOR_HIDDEN_SIZE_2)
meta/ddpg_actor_net.activation_fn = @tf.nn.relu
meta/ddpg_actor_net.zero_obs = False
meta/ddpg_actor_net.images = %IMAGES
# critic_net
CRITIC_HIDDEN_SIZE_1 = 300
CRITIC_HIDDEN_SIZE_2 = 300
agent/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
agent/ddpg_critic_net.actions_hidden_layers = None
agent/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
agent/ddpg_critic_net.weight_decay = 0.0
agent/ddpg_critic_net.activation_fn = @tf.nn.relu
agent/ddpg_critic_net.zero_obs = %ZERO_OBS
agent/ddpg_critic_net.images = %IMAGES
meta/ddpg_critic_net.states_hidden_layers = (%CRITIC_HIDDEN_SIZE_1,)
meta/ddpg_critic_net.actions_hidden_layers = None
meta/ddpg_critic_net.joint_hidden_layers = (%CRITIC_HIDDEN_SIZE_2,)
meta/ddpg_critic_net.weight_decay = 0.0
meta/ddpg_critic_net.activation_fn = @tf.nn.relu
meta/ddpg_critic_net.zero_obs = False
meta/ddpg_critic_net.images = %IMAGES
tf.losses.huber_loss.delta = 1.0
# Sample action
uvf_add_noise_fn.stddev = 1.0
meta_add_noise_fn.stddev = %META_EXPLORE_NOISE
# Update targets
ddpg_update_targets.tau = 0.001
td3_update_targets.tau = 0.005
research/efficient-hrl/configs/eval_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
# Config eval
evaluate.environment = @create_maze_env()
evaluate.agent_class = %AGENT_CLASS
evaluate.meta_agent_class = %META_CLASS
evaluate.state_preprocess_class = %STATE_PREPROCESS_CLASS
evaluate.num_episodes_eval = 50
evaluate.num_episodes_videos = 1
evaluate.gamma = 1.0
evaluate.eval_interval_secs = 1
evaluate.generate_videos = False
evaluate.generate_summaries = True
evaluate.eval_modes = %EVAL_MODES
evaluate.max_steps_per_episode = %RESET_EPISODE_PERIOD
research/efficient-hrl/configs/train_uvf.gin
0 → 100644
View file @
052361de
#-*-Python-*-
# Create replay_buffer
agent/CircularBuffer.buffer_size = 200000
meta/CircularBuffer.buffer_size = 200000
agent/CircularBuffer.scope = "agent"
meta/CircularBuffer.scope = "meta"
# Config train
train_uvf.environment = @create_maze_env()
train_uvf.agent_class = %AGENT_CLASS
train_uvf.meta_agent_class = %META_CLASS
train_uvf.state_preprocess_class = %STATE_PREPROCESS_CLASS
train_uvf.inverse_dynamics_class = %INVERSE_DYNAMICS_CLASS
train_uvf.replay_buffer = @agent/CircularBuffer()
train_uvf.meta_replay_buffer = @meta/CircularBuffer()
train_uvf.critic_optimizer = @critic/AdamOptimizer()
train_uvf.actor_optimizer = @actor/AdamOptimizer()
train_uvf.meta_critic_optimizer = @meta_critic/AdamOptimizer()
train_uvf.meta_actor_optimizer = @meta_actor/AdamOptimizer()
train_uvf.repr_optimizer = @repr/AdamOptimizer()
train_uvf.num_episodes_train = 25000
train_uvf.batch_size = 100
train_uvf.initial_episodes = 5
train_uvf.gamma = 0.99
train_uvf.meta_gamma = 0.99
train_uvf.reward_scale_factor = 1.0
train_uvf.target_update_period = 2
train_uvf.num_updates_per_observation = 1
train_uvf.num_collect_per_update = 1
train_uvf.num_collect_per_meta_update = 10
train_uvf.debug_summaries = False
train_uvf.log_every_n_steps = 1000
train_uvf.save_policy_every_n_steps =100000
# Config Optimizers
critic/AdamOptimizer.learning_rate = 0.001
critic/AdamOptimizer.beta1 = 0.9
critic/AdamOptimizer.beta2 = 0.999
actor/AdamOptimizer.learning_rate = 0.0001
actor/AdamOptimizer.beta1 = 0.9
actor/AdamOptimizer.beta2 = 0.999
meta_critic/AdamOptimizer.learning_rate = 0.001
meta_critic/AdamOptimizer.beta1 = 0.9
meta_critic/AdamOptimizer.beta2 = 0.999
meta_actor/AdamOptimizer.learning_rate = 0.0001
meta_actor/AdamOptimizer.beta1 = 0.9
meta_actor/AdamOptimizer.beta2 = 0.999
repr/AdamOptimizer.learning_rate = 0.0001
repr/AdamOptimizer.beta1 = 0.9
repr/AdamOptimizer.beta2 = 0.999
research/efficient-hrl/context/__init__.py
0 → 100644
View file @
052361de
research/efficient-hrl/context/configs/ant_block.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntBlock"
ZERO_OBS = False
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [3, 4]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_block_maze.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntBlockMaze"
ZERO_OBS = False
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (12, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [3, 4]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [8, 0]
eval2/ConstantSampler.value = [8, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_fall_multi.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 4.5]
research/efficient-hrl/context/configs/ant_fall_multi_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 3
task/relative_context_multi_transition_fn.k = 3
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 0]
research/efficient-hrl/context/configs/ant_fall_single.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntFall"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4, 0), (12, 28, 5))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [3]
meta/Context.samplers = {
"train": [@eval1/ConstantSampler],
"explore": [@eval1/ConstantSampler],
"eval1": [@eval1/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1, 2]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [0, 27, 4.5]
research/efficient-hrl/context/configs/ant_maze.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntMaze"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_maze_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntMaze"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-4, -4), (20, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval1", "eval2", "eval3"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval1": [@eval1/ConstantSampler],
"eval2": [@eval2/ConstantSampler],
"eval3": [@eval3/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 2
task/relative_context_multi_transition_fn.k = 2
MetaAgent.k = %SUBGOAL_DIM
eval1/ConstantSampler.value = [16, 0]
eval2/ConstantSampler.value = [16, 16]
eval3/ConstantSampler.value = [0, 16]
research/efficient-hrl/context/configs/ant_push_multi.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntPush"
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-16, -4), (16, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval2"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval2": [@eval2/ConstantSampler],
}
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = False
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
MetaAgent.k = %SUBGOAL_DIM
eval2/ConstantSampler.value = [0, 19]
research/efficient-hrl/context/configs/ant_push_multi_img.gin
0 → 100644
View file @
052361de
#-*-Python-*-
create_maze_env.env_name = "AntPush"
IMAGES = True
context_range = (%CONTEXT_RANGE_MIN, %CONTEXT_RANGE_MAX)
meta_context_range = ((-16, -4), (16, 20))
RESET_EPISODE_PERIOD = 500
RESET_ENV_PERIOD = 1
# End episode every N steps
UvfAgent.reset_episode_cond_fn = @every_n_steps
every_n_steps.n = %RESET_EPISODE_PERIOD
train_uvf.max_steps_per_episode = %RESET_EPISODE_PERIOD
# Do a manual reset every N episodes
UvfAgent.reset_env_cond_fn = @every_n_episodes
every_n_episodes.n = %RESET_ENV_PERIOD
every_n_episodes.steps_per_episode = %RESET_EPISODE_PERIOD
## Config defaults
EVAL_MODES = ["eval2"]
## Config agent
CONTEXT = @agent/Context
META_CONTEXT = @meta/Context
## Config agent context
agent/Context.context_ranges = [%context_range]
agent/Context.context_shapes = [%SUBGOAL_DIM]
agent/Context.meta_action_every_n = 10
agent/Context.samplers = {
"train": [@train/DirectionSampler],
"explore": [@train/DirectionSampler],
}
agent/Context.context_transition_fn = @relative_context_transition_fn
agent/Context.context_multi_transition_fn = @relative_context_multi_transition_fn
agent/Context.reward_fn = @uvf/negative_distance
## Config meta context
meta/Context.context_ranges = [%meta_context_range]
meta/Context.context_shapes = [2]
meta/Context.samplers = {
"train": [@train/RandomSampler],
"explore": [@train/RandomSampler],
"eval2": [@eval2/ConstantSampler],
}
meta/Context.context_transition_fn = @task/relative_context_transition_fn
meta/Context.context_multi_transition_fn = @task/relative_context_multi_transition_fn
meta/Context.reward_fn = @task/negative_distance
## Config rewards
task/negative_distance.state_indices = [0, 1]
task/negative_distance.relative_context = True
task/negative_distance.diff = False
task/negative_distance.offset = 0.0
## Config samplers
train/RandomSampler.context_range = %meta_context_range
train/DirectionSampler.context_range = %context_range
train/DirectionSampler.k = %SUBGOAL_DIM
relative_context_transition_fn.k = %SUBGOAL_DIM
relative_context_multi_transition_fn.k = %SUBGOAL_DIM
task/relative_context_transition_fn.k = 2
task/relative_context_multi_transition_fn.k = 2
MetaAgent.k = %SUBGOAL_DIM
eval2/ConstantSampler.value = [0, 19]
Prev
1
2
3
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment