Commit f2120b07 authored by Olga Wichrowska's avatar Olga Wichrowska
Browse files

Added code for Learned Optimizers that Scale and Generalize

parent 6024579b
......@@ -10,6 +10,7 @@ differential_privacy/* @panyx0718
domain_adaptation/* @bousmalis @ddohan
im2txt/* @cshallue
inception/* @shlens @vincentvanhoucke
learned_optimizer/* @olganw @nirum
learning_to_remember_rare_events/* @lukaszkaiser @ofirnachum
lfads/* @jazcollins @susillo
lm_1b/* @oriolvinyals @panyx0718
......
# Learning to Optimize Learning (LOL)
package(default_visibility = ["//visibility:public"])
# Libraries
# =========
py_library(
name = "metaopt",
srcs = ["metaopt.py"],
deps = [
"//learned_optimizer/problems:datasets",
"//learned_optimizer/problems:problem_generator",
],
)
# Binaries
# ========
py_binary(
name = "metarun",
srcs = ["metarun.py"],
deps = [
":metaopt",
"//learned_optimizer/optimizer:coordinatewise_rnn",
"//learned_optimizer/optimizer:global_learning_rate",
"//learned_optimizer/optimizer:hierarchical_rnn",
"//learned_optimizer/optimizer:learning_rate_schedule",
"//learned_optimizer/optimizer:trainable_adam",
"//learned_optimizer/problems:problem_sets",
"//learned_optimizer/problems:problem_spec",
],
)
# Learned Optimizer
Code for [Learned Optimizers that Scale and Generalize](https://arxiv.org/abs/1703.04813).
## Requirements
* Bazel ([install](https://bazel.build/versions/master/docs/install.html))
* TensorFlow >= v1.3
## Training a Learned Optimizer
## Code Overview
In the top-level directory, ```metaopt.py``` contains the code to train and test a learned optimizer. ```metarun.py``` packages the actual training procedure into a
single file, defining and exposing many flags to tune the procedure, from selecting the optimizer type and problem set to more fine-grained hyperparameter settings.
There is no testing binary; testing can be done ad-hoc via ```metaopt.test_optimizer``` by passing an optimizer object and a directory with a checkpoint.
The ```optimizer``` directory contains a base ```trainable_optimizer.py``` class and a number of extensions, including the ```hierarchical_rnn``` optimizer used in
the paper, a ```coordinatewise_rnn``` optimizer that more closely matches previous work, and a number of simpler optimizers to demonstrate the basic mechanics of
a learnable optimizer.
The ```problems``` directory contains the code to build the problems that were used in the meta-training set.
### Binaries
```metarun.py```: meta-training of a learned optimizer
### Command-Line Flags
The flags most relevant to meta-training are defined in ```metarun.py```. The default values will meta-train a HierarchicalRNN optimizer with the hyperparameter
settings used in the paper.
### Using a Learned Optimizer as a Black Box
The ```trainable_optimizer``` inherits from ```tf.train.Optimizer```, so a properly instantiated version can be used to train any model in any APIs that accept
this class. There are just 2 caveats:
1. If using the Hierarchical RNN optimizer, the apply_gradients return type must be changed (see comments inline for what exactly must be removed)
2. Care must be taken to restore the variables from the optimizer without overriding them. Optimizer variables should be loaded manually using a pretrained checkpoint
and a ```tf.train.Saver``` with only the optimizer variables. Then, when constructing the session, ensure that any automatic variable initialization does not
re-initialize the loaded optimizer variables.
## Contact for Issues
* Olga Wichrowska (@olganw), Niru Maheswaranathan (@nirum)
This diff is collapsed.
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Scripts for meta-optimization."""
from __future__ import print_function
import os
import tensorflow as tf
import metaopt
from learned_optimizer.optimizer import coordinatewise_rnn
from learned_optimizer.optimizer import global_learning_rate
from learned_optimizer.optimizer import hierarchical_rnn
from learned_optimizer.optimizer import learning_rate_schedule
from learned_optimizer.optimizer import trainable_adam
from learned_optimizer.problems import problem_sets as ps
from learned_optimizer.problems import problem_spec
tf.app.flags.DEFINE_string("train_dir", "/tmp/lol/",
"""Directory to store parameters and results.""")
tf.app.flags.DEFINE_integer("task", 0,
"""Task id of the replica running the training.""")
tf.app.flags.DEFINE_integer("worker_tasks", 1,
"""Number of tasks in the worker job.""")
tf.app.flags.DEFINE_integer("num_problems", 1000,
"""Number of sub-problems to run.""")
tf.app.flags.DEFINE_integer("num_meta_iterations", 5,
"""Number of meta-iterations to optimize.""")
tf.app.flags.DEFINE_integer("num_unroll_scale", 40,
"""The scale parameter of the exponential
distribution from which the number of partial
unrolls is drawn""")
tf.app.flags.DEFINE_integer("min_num_unrolls", 1,
"""The minimum number of unrolls per problem.""")
tf.app.flags.DEFINE_integer("num_partial_unroll_itr_scale", 200,
"""The scale parameter of the exponential
distribution from which the number of iterations
per unroll is drawn.""")
tf.app.flags.DEFINE_integer("min_num_itr_partial_unroll", 50,
"""The minimum number of iterations for one
unroll.""")
tf.app.flags.DEFINE_string("optimizer", "HierarchicalRNN",
"""Which meta-optimizer to train.""")
# CoordinatewiseRNN-specific flags
tf.app.flags.DEFINE_integer("cell_size", 20,
"""Size of the RNN hidden state in each layer.""")
tf.app.flags.DEFINE_integer("num_cells", 2,
"""Number of RNN layers.""")
tf.app.flags.DEFINE_string("cell_cls", "GRUCell",
"""Type of RNN cell to use.""")
# Metaoptimization parameters
tf.app.flags.DEFINE_float("meta_learning_rate", 1e-6,
"""The learning rate for the meta-optimizer.""")
tf.app.flags.DEFINE_float("gradient_clip_level", 1e4,
"""The level to clip gradients to.""")
# Training set selection
tf.app.flags.DEFINE_boolean("include_quadratic_problems", False,
"""Include non-noisy quadratic problems.""")
tf.app.flags.DEFINE_boolean("include_noisy_quadratic_problems", True,
"""Include noisy quadratic problems.""")
tf.app.flags.DEFINE_boolean("include_large_quadratic_problems", True,
"""Include very large quadratic problems.""")
tf.app.flags.DEFINE_boolean("include_bowl_problems", True,
"""Include 2D bowl problems.""")
tf.app.flags.DEFINE_boolean("include_softmax_2_class_problems", True,
"""Include 2-class logistic regression problems.""")
tf.app.flags.DEFINE_boolean("include_noisy_softmax_2_class_problems", True,
"""Include noisy 2-class logistic regression
problems.""")
tf.app.flags.DEFINE_boolean("include_optimization_test_problems", True,
"""Include non-noisy versions of classic
optimization test problems, e.g. Rosenbrock.""")
tf.app.flags.DEFINE_boolean("include_noisy_optimization_test_problems", True,
"""Include gradient-noise versions of classic
optimization test problems, e.g. Rosenbrock""")
tf.app.flags.DEFINE_boolean("include_fully_connected_random_2_class_problems",
True, """Include MLP problems for 2 classes.""")
tf.app.flags.DEFINE_boolean("include_matmul_problems", True,
"""Include matrix multiplication problems.""")
tf.app.flags.DEFINE_boolean("include_log_objective_problems", True,
"""Include problems where the objective is the log
objective of another problem, e.g. Bowl.""")
tf.app.flags.DEFINE_boolean("include_rescale_problems", True,
"""Include problems where the parameters are scaled
version of the original parameters.""")
tf.app.flags.DEFINE_boolean("include_norm_problems", True,
"""Include problems where the objective is the
N-norm of another problem, e.g. Quadratic.""")
tf.app.flags.DEFINE_boolean("include_sum_problems", True,
"""Include problems where the objective is the sum
of the objectives of the subproblems that make
up the problem parameters. Per-problem tensors
are still independent of each other.""")
tf.app.flags.DEFINE_boolean("include_sparse_gradient_problems", True,
"""Include problems where the gradient is set to 0
with some high probability.""")
tf.app.flags.DEFINE_boolean("include_sparse_softmax_problems", False,
"""Include sparse softmax problems.""")
tf.app.flags.DEFINE_boolean("include_one_hot_sparse_softmax_problems", False,
"""Include one-hot sparse softmax problems.""")
tf.app.flags.DEFINE_boolean("include_noisy_bowl_problems", True,
"""Include noisy bowl problems.""")
tf.app.flags.DEFINE_boolean("include_noisy_norm_problems", True,
"""Include noisy norm problems.""")
tf.app.flags.DEFINE_boolean("include_noisy_sum_problems", True,
"""Include noisy sum problems.""")
tf.app.flags.DEFINE_boolean("include_sum_of_quadratics_problems", False,
"""Include sum of quadratics problems.""")
tf.app.flags.DEFINE_boolean("include_projection_quadratic_problems", False,
"""Include projection quadratic problems.""")
tf.app.flags.DEFINE_boolean("include_outward_snake_problems", False,
"""Include outward snake problems.""")
tf.app.flags.DEFINE_boolean("include_dependency_chain_problems", False,
"""Include dependency chain problems.""")
tf.app.flags.DEFINE_boolean("include_min_max_well_problems", False,
"""Include min-max well problems.""")
# Optimizer parameters: initialization and scale values
tf.app.flags.DEFINE_float("min_lr", 1e-6,
"""The minimum initial learning rate.""")
tf.app.flags.DEFINE_float("max_lr", 1e-2,
"""The maximum initial learning rate.""")
# Optimizer parameters: small features.
tf.app.flags.DEFINE_boolean("zero_init_lr_weights", True,
"""Whether to initialize the learning rate weights
to 0 rather than the scaled random initialization
used for other RNN variables.""")
tf.app.flags.DEFINE_boolean("use_relative_lr", True,
"""Whether to use the relative learning rate as an
input during training. Can only be used if
learnable_decay is also True.""")
tf.app.flags.DEFINE_boolean("use_extreme_indicator", False,
"""Whether to use the extreme indicator for learning
rates as an input during training. Can only be
used if learnable_decay is also True.""")
tf.app.flags.DEFINE_boolean("use_log_means_squared", True,
"""Whether to track the log of the mean squared
grads instead of the means squared grads.""")
tf.app.flags.DEFINE_boolean("use_problem_lr_mean", True,
"""Whether to use the mean over all learning rates
in the problem when calculating the relative
learning rate.""")
# Optimizer parameters: major features
tf.app.flags.DEFINE_boolean("learnable_decay", True,
"""Whether to learn weights that dynamically
modulate the input scale via RMS decay.""")
tf.app.flags.DEFINE_boolean("dynamic_output_scale", True,
"""Whether to learn weights that dynamically
modulate the output scale.""")
tf.app.flags.DEFINE_boolean("use_log_objective", True,
"""Whether to use the log of the scaled objective
rather than just the scaled obj for training.""")
tf.app.flags.DEFINE_boolean("use_attention", False,
"""Whether to learn where to attend.""")
tf.app.flags.DEFINE_boolean("use_second_derivatives", True,
"""Whether to use second derivatives.""")
tf.app.flags.DEFINE_integer("num_gradient_scales", 4,
"""How many different timescales to keep for
gradient history. If > 1, also learns a scale
factor for gradient history.""")
tf.app.flags.DEFINE_float("max_log_lr", 33,
"""The maximum log learning rate allowed.""")
tf.app.flags.DEFINE_float("objective_training_max_multiplier", -1,
"""How much the objective can grow before training on
this problem / param pair is terminated. Sets a max
on the objective value when multiplied by the
initial objective. If <= 0, not used.""")
tf.app.flags.DEFINE_boolean("use_gradient_shortcut", True,
"""Whether to add a learned affine projection of the
gradient to the update delta in addition to the
gradient function computed by the RNN.""")
tf.app.flags.DEFINE_boolean("use_lr_shortcut", False,
"""Whether to add the difference between the current
learning rate and the desired learning rate to
the RNN input.""")
tf.app.flags.DEFINE_boolean("use_grad_products", True,
"""Whether to use gradient products in the input to
the RNN. Only applicable when num_gradient_scales
> 1.""")
tf.app.flags.DEFINE_boolean("use_multiple_scale_decays", False,
"""Whether to use many-timescale scale decays.""")
tf.app.flags.DEFINE_boolean("use_numerator_epsilon", False,
"""Whether to use epsilon in the numerator of the
log objective.""")
tf.app.flags.DEFINE_boolean("learnable_inp_decay", True,
"""Whether to learn input decay weight and bias.""")
tf.app.flags.DEFINE_boolean("learnable_rnn_init", True,
"""Whether to learn RNN state initialization.""")
FLAGS = tf.app.flags.FLAGS
# The Size of the RNN hidden state in each layer:
# [PerParam, PerTensor, Global]. The length of this list must be 1, 2, or 3.
# If less than 3, the Global and/or PerTensor RNNs will not be created.
HRNN_CELL_SIZES = [10, 20, 20]
def register_optimizers():
opts = {}
opts["CoordinatewiseRNN"] = coordinatewise_rnn.CoordinatewiseRNN
opts["GlobalLearningRate"] = global_learning_rate.GlobalLearningRate
opts["HierarchicalRNN"] = hierarchical_rnn.HierarchicalRNN
opts["LearningRateSchedule"] = learning_rate_schedule.LearningRateSchedule
opts["TrainableAdam"] = trainable_adam.TrainableAdam
return opts
def main(unused_argv):
"""Runs the main script."""
opts = register_optimizers()
# Choose a set of problems to optimize. By default this includes quadratics,
# 2-dimensional bowls, 2-class softmax problems, and non-noisy optimization
# test problems (e.g. Rosenbrock, Beale)
problems_and_data = []
if FLAGS.include_sparse_softmax_problems:
problems_and_data.extend(ps.sparse_softmax_2_class_sparse_problems())
if FLAGS.include_one_hot_sparse_softmax_problems:
problems_and_data.extend(
ps.one_hot_sparse_softmax_2_class_sparse_problems())
if FLAGS.include_quadratic_problems:
problems_and_data.extend(ps.quadratic_problems())
if FLAGS.include_noisy_quadratic_problems:
problems_and_data.extend(ps.quadratic_problems_noisy())
if FLAGS.include_large_quadratic_problems:
problems_and_data.extend(ps.quadratic_problems_large())
if FLAGS.include_bowl_problems:
problems_and_data.extend(ps.bowl_problems())
if FLAGS.include_noisy_bowl_problems:
problems_and_data.extend(ps.bowl_problems_noisy())
if FLAGS.include_softmax_2_class_problems:
problems_and_data.extend(ps.softmax_2_class_problems())
if FLAGS.include_noisy_softmax_2_class_problems:
problems_and_data.extend(ps.softmax_2_class_problems_noisy())
if FLAGS.include_optimization_test_problems:
problems_and_data.extend(ps.optimization_test_problems())
if FLAGS.include_noisy_optimization_test_problems:
problems_and_data.extend(ps.optimization_test_problems_noisy())
if FLAGS.include_fully_connected_random_2_class_problems:
problems_and_data.extend(ps.fully_connected_random_2_class_problems())
if FLAGS.include_matmul_problems:
problems_and_data.extend(ps.matmul_problems())
if FLAGS.include_log_objective_problems:
problems_and_data.extend(ps.log_objective_problems())
if FLAGS.include_rescale_problems:
problems_and_data.extend(ps.rescale_problems())
if FLAGS.include_norm_problems:
problems_and_data.extend(ps.norm_problems())
if FLAGS.include_noisy_norm_problems:
problems_and_data.extend(ps.norm_problems_noisy())
if FLAGS.include_sum_problems:
problems_and_data.extend(ps.sum_problems())
if FLAGS.include_noisy_sum_problems:
problems_and_data.extend(ps.sum_problems_noisy())
if FLAGS.include_sparse_gradient_problems:
problems_and_data.extend(ps.sparse_gradient_problems())
if FLAGS.include_fully_connected_random_2_class_problems:
problems_and_data.extend(ps.sparse_gradient_problems_mlp())
if FLAGS.include_min_max_well_problems:
problems_and_data.extend(ps.min_max_well_problems())
if FLAGS.include_sum_of_quadratics_problems:
problems_and_data.extend(ps.sum_of_quadratics_problems())
if FLAGS.include_projection_quadratic_problems:
problems_and_data.extend(ps.projection_quadratic_problems())
if FLAGS.include_outward_snake_problems:
problems_and_data.extend(ps.outward_snake_problems())
if FLAGS.include_dependency_chain_problems:
problems_and_data.extend(ps.dependency_chain_problems())
# log directory
logdir = os.path.join(FLAGS.train_dir,
"{}_{}_{}_{}".format(FLAGS.optimizer,
FLAGS.cell_cls,
FLAGS.cell_size,
FLAGS.num_cells))
# get the optimizer class and arguments
optimizer_cls = opts[FLAGS.optimizer]
assert len(HRNN_CELL_SIZES) in [1, 2, 3]
optimizer_args = (HRNN_CELL_SIZES,)
optimizer_kwargs = {
"init_lr_range": (FLAGS.min_lr, FLAGS.max_lr),
"learnable_decay": FLAGS.learnable_decay,
"dynamic_output_scale": FLAGS.dynamic_output_scale,
"cell_cls": getattr(tf.contrib.rnn, FLAGS.cell_cls),
"use_attention": FLAGS.use_attention,
"use_log_objective": FLAGS.use_log_objective,
"num_gradient_scales": FLAGS.num_gradient_scales,
"zero_init_lr_weights": FLAGS.zero_init_lr_weights,
"use_log_means_squared": FLAGS.use_log_means_squared,
"use_relative_lr": FLAGS.use_relative_lr,
"use_extreme_indicator": FLAGS.use_extreme_indicator,
"max_log_lr": FLAGS.max_log_lr,
"obj_train_max_multiplier": FLAGS.objective_training_max_multiplier,
"use_problem_lr_mean": FLAGS.use_problem_lr_mean,
"use_gradient_shortcut": FLAGS.use_gradient_shortcut,
"use_second_derivatives": FLAGS.use_second_derivatives,
"use_lr_shortcut": FLAGS.use_lr_shortcut,
"use_grad_products": FLAGS.use_grad_products,
"use_multiple_scale_decays": FLAGS.use_multiple_scale_decays,
"use_numerator_epsilon": FLAGS.use_numerator_epsilon,
"learnable_inp_decay": FLAGS.learnable_inp_decay,
"learnable_rnn_init": FLAGS.learnable_rnn_init,
}
optimizer_spec = problem_spec.Spec(
optimizer_cls, optimizer_args, optimizer_kwargs)
# make log directory
tf.gfile.MakeDirs(logdir)
is_chief = FLAGS.task == 0
# if this is a distributed run, make the chief run through problems in order
select_random_problems = FLAGS.worker_tasks == 1 or not is_chief
def num_unrolls():
return metaopt.sample_numiter(FLAGS.num_unroll_scale, FLAGS.min_num_unrolls)
def num_partial_unroll_itrs():
return metaopt.sample_numiter(FLAGS.num_partial_unroll_itr_scale,
FLAGS.min_num_itr_partial_unroll)
# run it
metaopt.train_optimizer(
logdir,
optimizer_spec,
problems_and_data,
FLAGS.num_problems,
FLAGS.num_meta_iterations,
num_unrolls,
num_partial_unroll_itrs,
learning_rate=FLAGS.meta_learning_rate,
gradient_clip=FLAGS.gradient_clip_level,
is_chief=is_chief,
select_random_problems=select_random_problems,
obj_train_max_multiplier=FLAGS.objective_training_max_multiplier,
callbacks=[])
return 0
if __name__ == "__main__":
tf.app.run()
package(default_visibility = ["//visibility:public"])
# Libraries
# =========
py_library(
name = "coordinatewise_rnn",
srcs = ["coordinatewise_rnn.py"],
deps = [
":trainable_optimizer",
":utils",
],
)
py_library(
name = "global_learning_rate",
srcs = ["global_learning_rate.py"],
deps = [
":trainable_optimizer",
],
)
py_library(
name = "hierarchical_rnn",
srcs = ["hierarchical_rnn.py"],
deps = [
":rnn_cells",
":trainable_optimizer",
":utils",
],
)
py_library(
name = "learning_rate_schedule",
srcs = ["learning_rate_schedule.py"],
deps = [
":trainable_optimizer",
],
)
py_library(
name = "rnn_cells",
srcs = ["rnn_cells.py"],
deps = [
":utils",
],
)
py_library(
name = "trainable_adam",
srcs = ["trainable_adam.py"],
deps = [
":trainable_optimizer",
":utils",
],
)
py_library(
name = "trainable_optimizer",
srcs = ["trainable_optimizer.py"],
deps = [
],
)
py_library(
name = "utils",
srcs = ["utils.py"],
deps = [
],
)
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Collection of trainable optimizers for meta-optimization."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import numpy as np
import tensorflow as tf
from learned_optimizer.optimizer import utils
from learned_optimizer.optimizer import trainable_optimizer as opt
# Default was 1e-3
tf.app.flags.DEFINE_float("crnn_rnn_readout_scale", 0.5,
"""The initialization scale for the RNN readouts.""")
tf.app.flags.DEFINE_float("crnn_default_decay_var_init", 2.2,
"""The default initializer value for any decay/
momentum style variables and constants.
sigmoid(2.2) ~ 0.9, sigmoid(-2.2) ~ 0.01.""")
FLAGS = tf.flags.FLAGS
class CoordinatewiseRNN(opt.TrainableOptimizer):
"""RNN that operates on each coordinate of the problem independently."""
def __init__(self,
cell_sizes,
cell_cls,
init_lr_range=(1., 1.),
dynamic_output_scale=True,
learnable_decay=True,
zero_init_lr_weights=False,
**kwargs):
"""Initializes the RNN per-parameter optimizer.
Args:
cell_sizes: List of hidden state sizes for each RNN cell in the network
cell_cls: tf.contrib.rnn class for specifying the RNN cell type
init_lr_range: the range in which to initialize the learning rates.
dynamic_output_scale: whether to learn weights that dynamically modulate
the output scale (default: True)
learnable_decay: whether to learn weights that dynamically modulate the
input scale via RMS style decay (default: True)
zero_init_lr_weights: whether to initialize the lr weights to zero
**kwargs: args passed to TrainableOptimizer's constructor
Raises:
ValueError: If the init lr range is not of length 2.
ValueError: If the init lr range is not a valid range (min > max).
"""
if len(init_lr_range) != 2:
raise ValueError(
"Initial LR range must be len 2, was {}".format(len(init_lr_range)))
if init_lr_range[0] > init_lr_range[1]:
raise ValueError("Initial LR range min is greater than max.")
self.init_lr_range = init_lr_range
self.zero_init_lr_weights = zero_init_lr_weights
self.reuse_vars = False
# create the RNN cell
with tf.variable_scope(opt.OPTIMIZER_SCOPE):
self.component_cells = [cell_cls(sz) for sz in cell_sizes]
self.cell = tf.contrib.rnn.MultiRNNCell(self.component_cells)
# random normal initialization scaled by the output size
scale_factor = FLAGS.crnn_rnn_readout_scale / math.sqrt(cell_sizes[-1])
scaled_init = tf.random_normal_initializer(0., scale_factor)
# weights for projecting the hidden state to a parameter update
self.update_weights = tf.get_variable("update_weights",
shape=(cell_sizes[-1], 1),
initializer=scaled_init)
self._initialize_decay(learnable_decay, (cell_sizes[-1], 1), scaled_init)
self._initialize_lr(dynamic_output_scale, (cell_sizes[-1], 1),
scaled_init)
state_size = sum([sum(state_size) for state_size in self.cell.state_size])
self._init_vector = tf.get_variable(
"init_vector", shape=[1, state_size],
initializer=tf.random_uniform_initializer(-1., 1.))
state_keys = ["rms", "rnn", "learning_rate", "decay"]
super(CoordinatewiseRNN, self).__init__("cRNN", state_keys, **kwargs)
def _initialize_decay(
self, learnable_decay, weights_tensor_shape, scaled_init):
"""Initializes the decay weights and bias variables or tensors.
Args:
learnable_decay: Whether to use learnable decay.
weights_tensor_shape: The shape the weight tensor should take.
scaled_init: The scaled initialization for the weights tensor.
"""
if learnable_decay:
# weights for projecting the hidden state to the RMS decay term
self.decay_weights = tf.get_variable("decay_weights",
shape=weights_tensor_shape,
initializer=scaled_init)
self.decay_bias = tf.get_variable(
"decay_bias", shape=(1,),
initializer=tf.constant_initializer(
FLAGS.crnn_default_decay_var_init))
else:
self.decay_weights = tf.zeros_like(self.update_weights)
self.decay_bias = tf.constant(FLAGS.crnn_default_decay_var_init)
def _initialize_lr(
self, dynamic_output_scale, weights_tensor_shape, scaled_init):
"""Initializes the learning rate weights and bias variables or tensors.
Args:
dynamic_output_scale: Whether to use a dynamic output scale.
weights_tensor_shape: The shape the weight tensor should take.
scaled_init: The scaled initialization for the weights tensor.
"""
if dynamic_output_scale:
zero_init = tf.constant_initializer(0.)
wt_init = zero_init if self.zero_init_lr_weights else scaled_init
self.lr_weights = tf.get_variable("learning_rate_weights",
shape=weights_tensor_shape,
initializer=wt_init)
self.lr_bias = tf.get_variable("learning_rate_bias", shape=(1,),
initializer=zero_init)
else:
self.lr_weights = tf.zeros_like(self.update_weights)
self.lr_bias = tf.zeros([1, 1])
def _initialize_state(self, var):
"""Return a dictionary mapping names of state variables to their values."""
vectorized_shape = [var.get_shape().num_elements(), 1]
min_lr = self.init_lr_range[0]
max_lr = self.init_lr_range[1]
if min_lr == max_lr:
init_lr = tf.constant(min_lr, shape=vectorized_shape)
else:
actual_vals = tf.random_uniform(vectorized_shape,
np.log(min_lr),
np.log(max_lr))
init_lr = tf.exp(actual_vals)
ones = tf.ones(vectorized_shape)
rnn_init = ones * self._init_vector
return {
"rms": tf.ones(vectorized_shape),
"learning_rate": init_lr,
"rnn": rnn_init,
"decay": tf.ones(vectorized_shape),
}
def _compute_update(self, param, grad, state):
"""Update parameters given the gradient and state.
Args:
param: tensor of parameters
grad: tensor of gradients with the same shape as param
state: a dictionary containing any state for the optimizer
Returns:
updated_param: updated parameters
updated_state: updated state variables in a dictionary
"""
with tf.variable_scope(opt.OPTIMIZER_SCOPE) as scope:
if self.reuse_vars:
scope.reuse_variables()
else:
self.reuse_vars = True
param_shape = tf.shape(param)
(grad_values, decay_state, rms_state, rnn_state, learning_rate_state,
grad_indices) = self._extract_gradients_and_internal_state(
grad, state, param_shape)
# Vectorize and scale the gradients.
grad_scaled, rms = utils.rms_scaling(grad_values, decay_state, rms_state)
# Apply the RNN update.
rnn_state_tuples = self._unpack_rnn_state_into_tuples(rnn_state)
rnn_output, rnn_state_tuples = self.cell(grad_scaled, rnn_state_tuples)
rnn_state = self._pack_tuples_into_rnn_state(rnn_state_tuples)
# Compute the update direction (a linear projection of the RNN output).
delta = utils.project(rnn_output, self.update_weights)
# The updated decay is an affine projection of the hidden state
decay = utils.project(rnn_output, self.decay_weights,
bias=self.decay_bias, activation=tf.nn.sigmoid)
# Compute the change in learning rate (an affine projection of the RNN
# state, passed through a 2x sigmoid, so the change is bounded).
learning_rate_change = 2. * utils.project(rnn_output, self.lr_weights,
bias=self.lr_bias,
activation=tf.nn.sigmoid)
# Update the learning rate.
new_learning_rate = learning_rate_change * learning_rate_state
# Apply the update to the parameters.
update = tf.reshape(new_learning_rate * delta, tf.shape(grad_values))
if isinstance(grad, tf.IndexedSlices):
update = utils.stack_tensor(update, grad_indices, param,
param_shape[:1])
rms = utils.update_slices(rms, grad_indices, state["rms"], param_shape)
new_learning_rate = utils.update_slices(new_learning_rate, grad_indices,
state["learning_rate"],
param_shape)
rnn_state = utils.update_slices(rnn_state, grad_indices, state["rnn"],
param_shape)
decay = utils.update_slices(decay, grad_indices, state["decay"],
param_shape)
new_param = param - update
# Collect the update and new state.
new_state = {
"rms": rms,
"learning_rate": new_learning_rate,
"rnn": rnn_state,
"decay": decay,
}
return new_param, new_state
def _extract_gradients_and_internal_state(self, grad, state, param_shape):
"""Extracts the gradients and relevant internal state.
If the gradient is sparse, extracts the appropriate slices from the state.
Args:
grad: The current gradient.
state: The current state.
param_shape: The shape of the parameter (used if gradient is sparse).
Returns:
grad_values: The gradient value tensor.
decay_state: The current decay state.
rms_state: The current rms state.
rnn_state: The current state of the internal rnns.
learning_rate_state: The current learning rate state.
grad_indices: The indices for the gradient tensor, if sparse.
None otherwise.
"""
if isinstance(grad, tf.IndexedSlices):
grad_indices, grad_values = utils.accumulate_sparse_gradients(grad)
decay_state = utils.slice_tensor(state["decay"], grad_indices,
param_shape)
rms_state = utils.slice_tensor(state["rms"], grad_indices, param_shape)
rnn_state = utils.slice_tensor(state["rnn"], grad_indices, param_shape)
learning_rate_state = utils.slice_tensor(state["learning_rate"],
grad_indices, param_shape)
decay_state.set_shape([None, 1])
rms_state.set_shape([None, 1])
else:
grad_values = grad
grad_indices = None
decay_state = state["decay"]
rms_state = state["rms"]
rnn_state = state["rnn"]
learning_rate_state = state["learning_rate"]
return (grad_values, decay_state, rms_state, rnn_state, learning_rate_state,
grad_indices)
def _unpack_rnn_state_into_tuples(self, rnn_state):
"""Creates state tuples from the rnn state vector."""
rnn_state_tuples = []
cur_state_pos = 0
for cell in self.component_cells:
total_state_size = sum(cell.state_size)
cur_state = tf.slice(rnn_state, [0, cur_state_pos],
[-1, total_state_size])
cur_state_tuple = tf.split(value=cur_state, num_or_size_splits=2,
axis=1)
rnn_state_tuples.append(cur_state_tuple)
cur_state_pos += total_state_size
return rnn_state_tuples
def _pack_tuples_into_rnn_state(self, rnn_state_tuples):
"""Creates a single state vector concatenated along column axis."""
rnn_state = None
for new_state_tuple in rnn_state_tuples:
new_c, new_h = new_state_tuple
if rnn_state is None:
rnn_state = tf.concat([new_c, new_h], axis=1)
else:
rnn_state = tf.concat([rnn_state, tf.concat([new_c, new_h], 1)], axis=1)
return rnn_state
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A trainable optimizer that learns a single global learning rate."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from learned_optimizer.optimizer import trainable_optimizer
class GlobalLearningRate(trainable_optimizer.TrainableOptimizer):
"""Optimizes for a single global learning rate."""
def __init__(self, initial_rate=1e-3, **kwargs):
"""Initializes the global learning rate."""
with tf.variable_scope(trainable_optimizer.OPTIMIZER_SCOPE):
initializer = tf.constant_initializer(initial_rate)
self.learning_rate = tf.get_variable("global_learning_rate", shape=(),
initializer=initializer)
super(GlobalLearningRate, self).__init__("GLR", [], **kwargs)
def _compute_update(self, param, grad, state):
return param - tf.scalar_mul(self.learning_rate, grad), state
This diff is collapsed.
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A trainable optimizer that learns a learning rate schedule."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from learned_optimizer.optimizer import trainable_optimizer
class LearningRateSchedule(trainable_optimizer.TrainableOptimizer):
"""Learns a learning rate schedule over a fixed number of iterations."""
def __init__(self, initial_rate=0.0, n_steps=1000, **kwargs):
"""Initializes the learning rates."""
self.max_index = tf.constant(n_steps-1, dtype=tf.int32)
with tf.variable_scope(trainable_optimizer.OPTIMIZER_SCOPE):
initializer = tf.constant_initializer(initial_rate)
self.learning_rates = tf.get_variable("learning_rates",
shape=([n_steps,]),
initializer=initializer)
super(LearningRateSchedule, self).__init__("LRS", ["itr"], **kwargs)
def _initialize_state(self, var):
"""Return a dictionary mapping names of state variables to their values."""
return {
"itr": tf.constant(0, dtype=tf.int32),
}
def _compute_update(self, param, grad, state):
"""Compute updates of parameters."""
# get the learning rate at the current index, if the index
# is greater than the number of available learning rates,
# use the last one
index = tf.minimum(state["itr"], self.max_index)
learning_rate = tf.gather(self.learning_rates, index)
# update the parameters: parameter - learning_rate * gradient
updated_param = param - tf.scalar_mul(learning_rate, grad)
return updated_param, {"itr": state["itr"] + 1}
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Custom RNN cells for hierarchical RNNs."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from learned_optimizer.optimizer import utils
class BiasGRUCell(tf.contrib.rnn.RNNCell):
"""GRU cell (cf. http://arxiv.org/abs/1406.1078) with an additional bias."""
def __init__(self, num_units, activation=tf.tanh, scale=0.1,
gate_bias_init=0., random_seed=None):
self._num_units = num_units
self._activation = activation
self._scale = scale
self._gate_bias_init = gate_bias_init
self._random_seed = random_seed
@property
def state_size(self):
return self._num_units
@property
def output_size(self):
return self._num_units
def __call__(self, inputs, state, bias=None):
# Split the injected bias vector into a bias for the r, u, and c updates.
if bias is None:
bias = tf.zeros((1, 3))
r_bias, u_bias, c_bias = tf.split(bias, 3, 1)
with tf.variable_scope(type(self).__name__): # "BiasGRUCell"
with tf.variable_scope("gates"): # Reset gate and update gate.
proj = utils.affine([inputs, state], 2 * self._num_units,
scale=self._scale, bias_init=self._gate_bias_init,
random_seed=self._random_seed)
r_lin, u_lin = tf.split(proj, 2, 1)
r, u = tf.nn.sigmoid(r_lin + r_bias), tf.nn.sigmoid(u_lin + u_bias)
with tf.variable_scope("candidate"):
proj = utils.affine([inputs, r * state], self._num_units,
scale=self._scale, random_seed=self._random_seed)
c = self._activation(proj + c_bias)
new_h = u * state + (1 - u) * c
return new_h, new_h
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A trainable ADAM optimizer that learns its internal variables."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from learned_optimizer.optimizer import trainable_optimizer as opt
from learned_optimizer.optimizer import utils
class TrainableAdam(opt.TrainableOptimizer):
"""Adam optimizer with learnable scalar parameters.
See Kingma et. al., 2014 for algorithm (http://arxiv.org/abs/1412.6980).
"""
def __init__(self,
learning_rate=1e-3,
beta1=0.9,
beta2=0.999,
epsilon=1e-8,
**kwargs):
"""Initializes the TrainableAdam optimizer with the given initial values.
Args:
learning_rate: The learning rate (default: 1e-3).
beta1: The exponential decay rate for the 1st moment estimates.
beta2: The exponential decay rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability.
**kwargs: Any additional keyword arguments for TrainableOptimizer.
Raises:
ValueError: if the learning rate or epsilon is not positive
ValueError: if beta1 or beta2 is not in (0, 1).
"""
if learning_rate <= 0:
raise ValueError("Learning rate must be positive.")
if epsilon <= 0:
raise ValueError("Epsilon must be positive.")
if not 0 < beta1 < 1 or not 0 < beta2 < 1:
raise ValueError("Beta values must be between 0 and 1, exclusive.")
self._reuse_vars = False
with tf.variable_scope(opt.OPTIMIZER_SCOPE):
def inv_sigmoid(x):
return np.log(x / (1.0 - x))
self.log_learning_rate = tf.get_variable(
"log_learning_rate",
shape=[],
initializer=tf.constant_initializer(np.log(learning_rate)))
self.beta1_logit = tf.get_variable(
"beta1_logit",
shape=[],
initializer=tf.constant_initializer(inv_sigmoid(beta1)))
self.beta2_logit = tf.get_variable(
"beta2_logit",
shape=[],
initializer=tf.constant_initializer(inv_sigmoid(beta2)))
self.log_epsilon = tf.get_variable(
"log_epsilon",
shape=[],
initializer=tf.constant_initializer(np.log(epsilon)))
# Key names are derived from Algorithm 1 described in
# https://arxiv.org/pdf/1412.6980.pdf
state_keys = ["m", "v", "t"]
super(TrainableAdam, self).__init__("Adam", state_keys, **kwargs)
def _initialize_state(self, var):
"""Returns a dictionary mapping names of state variables to their values."""
vectorized_shape = var.get_shape().num_elements(), 1
return {key: tf.zeros(vectorized_shape) for key in self.state_keys}
def _compute_update(self, param, grad, state):
"""Calculates the new internal state and parameters.
If the gradient is sparse, updates the appropriate slices in the internal
state and stacks the update tensor.
Args:
param: A tensor of parameters.
grad: A tensor of gradients with the same shape as param.
state: A dictionary containing any state for the optimizer.
Returns:
updated_param: The updated parameters.
updated_state: The updated state variables in a dictionary.
"""
with tf.variable_scope(opt.OPTIMIZER_SCOPE) as scope:
if self._reuse_vars:
scope.reuse_variables()
else:
self._reuse_vars = True
(grad_values, first_moment, second_moment, timestep, grad_indices
) = self._extract_gradients_and_internal_state(
grad, state, tf.shape(param))
beta1 = tf.nn.sigmoid(self.beta1_logit)
beta2 = tf.nn.sigmoid(self.beta2_logit)
epsilon = tf.exp(self.log_epsilon) + 1e-10
learning_rate = tf.exp(self.log_learning_rate)
old_grad_shape = tf.shape(grad_values)
grad_values = tf.reshape(grad_values, [-1, 1])
new_timestep = timestep + 1
new_first_moment = self._update_adam_estimate(
first_moment, grad_values, beta1)
new_second_moment = self._debias_adam_estimate(
second_moment, tf.square(grad_values), beta2)
debiased_first_moment = self._debias_adam_estimate(
new_first_moment, beta1, new_timestep)
debiased_second_moment = self._debias_adam_estimate(
new_second_moment, beta2, new_timestep)
# Propagating through the square root of 0 is very bad for stability.
update = (learning_rate * debiased_first_moment /
(tf.sqrt(debiased_second_moment + 1e-10) + epsilon))
update = tf.reshape(update, old_grad_shape)
if grad_indices is not None:
param_shape = tf.shape(param)
update = utils.stack_tensor(
update, grad_indices, param, param_shape[:1])
new_first_moment = utils.update_slices(
new_first_moment, grad_indices, state["m"], param_shape)
new_second_moment = utils.update_slices(
new_second_moment, grad_indices, state["v"], param_shape)
new_timestep = utils.update_slices(
new_timestep, grad_indices, state["t"], param_shape)
new_param = param - update
# collect the update and new state
new_state = {
"m": new_first_moment,
"v": new_second_moment,
"t": new_timestep
}
return new_param, new_state
def _update_adam_estimate(self, estimate, value, beta):
"""Returns a beta-weighted average of estimate and value."""
return (beta * estimate) + ((1 - beta) * value)
def _debias_adam_estimate(self, estimate, beta, t_step):
"""Returns a debiased estimate based on beta and the timestep."""
return estimate / (1 - tf.pow(beta, t_step))
def _extract_gradients_and_internal_state(self, grad, state, param_shape):
"""Extracts the gradients and relevant internal state.
If the gradient is sparse, extracts the appropriate slices from the state.
Args:
grad: The current gradient.
state: The current state.
param_shape: The shape of the parameter (used if gradient is sparse).
Returns:
grad_values: The gradient value tensor.
first_moment: The first moment tensor (internal state).
second_moment: The second moment tensor (internal state).
timestep: The current timestep (internal state).
grad_indices: The indices for the gradient tensor, if sparse.
None otherwise.
"""
grad_values = grad
grad_indices = None
first_moment = state["m"]
second_moment = state["v"]
timestep = state["t"]
if isinstance(grad, tf.IndexedSlices):
grad_indices, grad_values = utils.accumulate_sparse_gradients(grad)
first_moment = utils.slice_tensor(
first_moment, grad_indices, param_shape)
second_moment = utils.slice_tensor(
second_moment, grad_indices, param_shape)
timestep = utils.slice_tensor(timestep, grad_indices, param_shape)
return grad_values, first_moment, second_moment, timestep, grad_indices
This diff is collapsed.
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Utilities and helper functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
def make_finite(t, replacement):
"""Replaces non-finite tensor values with the replacement value."""
return tf.where(tf.is_finite(t), t, replacement)
def asinh(x):
"""Computes the inverse hyperbolic sine function (in tensorflow)."""
return tf.log(x + tf.sqrt(1. + x ** 2))
def affine(inputs, output_size, scope="Affine", scale=0.1, vec_mean=0.,
include_bias=True, bias_init=0., random_seed=None):
"""Computes an affine function of the inputs.
Creates or recalls tensorflow variables "Matrix" and "Bias"
to generate an affine operation on the input.
If the inputs are a list of tensors, they are concatenated together.
Initial weights for the matrix are drawn from a Gaussian with zero
mean and standard deviation that is the given scale divided by the
square root of the input dimension. Initial weights for the bias are
set to zero.
Args:
inputs: List of tensors with shape (batch_size, input_size)
output_size: Size (dimension) of the output
scope: Variable scope for these parameters (default: "Affine")
scale: Initial weight scale for the matrix parameters (default: 0.1),
this constant is divided by the sqrt of the input size to get the
std. deviation of the initial weights
vec_mean: The mean for the random initializer
include_bias: Whether to include the bias term
bias_init: The initializer bias (default 0.)
random_seed: Random seed for random initializers. (Default: None)
Returns:
output: Tensor with shape (batch_size, output_size)
"""
# Concatenate the input arguments.
x = tf.concat(inputs, 1)
with tf.variable_scope(scope):
input_size = x.get_shape().as_list()[1]
sigma = scale / np.sqrt(input_size)
rand_init = tf.random_normal_initializer(mean=vec_mean, stddev=sigma,
seed=random_seed)
matrix = tf.get_variable("Matrix", [input_size, output_size],
dtype=tf.float32, initializer=rand_init)
if include_bias:
bias = tf.get_variable("Bias", [output_size], dtype=tf.float32,
initializer=tf.constant_initializer(bias_init,
tf.float32))
else:
bias = 0.
output = tf.matmul(x, matrix) + bias
return output
def project(inputs, weights, bias=0., activation=tf.identity):
"""Computes an affine or linear projection of the inputs.
Projects the inputs onto the given weight vector and (optionally)
adds a bias and passes the result through an activation function.
Args:
inputs: matrix of inputs with shape [batch_size, dim]
weights: weight matrix with shape [dim, output_dim]
bias: bias vector with shape [output_dim] (default: 0)
activation: nonlinear activation function (default: tf.identity)
Returns:
outputs: an op which computes activation(inputs @ weights + bias)
"""
return activation(tf.matmul(inputs, weights) + bias)
def new_mean_squared(grad_vec, decay, ms):
"""Calculates the new accumulated mean squared of the gradient.
Args:
grad_vec: the vector for the current gradient
decay: the decay term
ms: the previous mean_squared value
Returns:
the new mean_squared value
"""
decay_size = decay.get_shape().num_elements()
decay_check_ops = [
tf.assert_less_equal(decay, 1., summarize=decay_size),
tf.assert_greater_equal(decay, 0., summarize=decay_size)]
with tf.control_dependencies(decay_check_ops):
grad_squared = tf.square(grad_vec)
# If the previous mean_squared is the 0 vector, don't use the decay and just
# return the full grad_squared. This should only happen on the first timestep.
decay = tf.cond(tf.reduce_all(tf.equal(ms, 0.)),
lambda: tf.zeros_like(decay, dtype=tf.float32), lambda: decay)
# Update the running average of squared gradients.
epsilon = 1e-12
return (1. - decay) * (grad_squared + epsilon) + decay * ms
def rms_scaling(gradient, decay, ms, update_ms=True):
"""Vectorizes and scales a tensor of gradients.
Args:
gradient: the current gradient
decay: the current decay value.
ms: the previous mean squared value
update_ms: Whether to update the mean squared value (default: True)
Returns:
The scaled gradient and the new ms value if update_ms is True,
the old ms value otherwise.
"""
# Vectorize the gradients and compute the squared gradients.
grad_vec = tf.reshape(gradient, [-1, 1])
if update_ms:
ms = new_mean_squared(grad_vec, decay, ms)
# Scale the current gradients by the RMS, squashed by the asinh function.
scaled_gradient = asinh(grad_vec / tf.sqrt(ms + 1e-16))
return scaled_gradient, ms
def accumulate_sparse_gradients(grad):
"""Accumulates repeated indices of a sparse gradient update.
Args:
grad: a tf.IndexedSlices gradient
Returns:
grad_indices: unique indices
grad_values: gradient values corresponding to the indices
"""
grad_indices, grad_segments = tf.unique(grad.indices)
grad_values = tf.unsorted_segment_sum(grad.values, grad_segments,
tf.shape(grad_indices)[0])
return grad_indices, grad_values
def slice_tensor(dense_tensor, indices, head_dims):
"""Extracts slices from a partially flattened dense tensor.
indices is assumed to index into the first dimension of head_dims.
dense_tensor is assumed to have a shape [D_0, D_1, ...] such that
prod(head_dims) == D_0. This function will extract slices along the
first_dimension of head_dims.
Example:
Consider a tensor with shape head_dims = [100, 2] and a dense_tensor with
shape [200, 3]. Note that the first dimension of dense_tensor equals the
product of head_dims. This function will reshape dense_tensor such that
its shape is now [100, 2, 3] (i.e. the first dimension became head-dims)
and then slice it along the first dimension. After slicing, the slices will
have their initial dimensions flattened just as they were in dense_tensor
(e.g. if there are 4 indices, the return value will have a shape of [4, 3]).
Args:
dense_tensor: a N-D dense tensor. Shape: [D_0, D_1, ...]
indices: a 1-D integer tensor. Shape: [K]
head_dims: True dimensions of the dense_tensor's first dimension.
Returns:
Extracted slices. Shape [K, D_1, ...]
"""
tail_dims = tf.shape(dense_tensor)[1:]
dense_tensor = tf.reshape(dense_tensor,
tf.concat([head_dims, tail_dims], 0))
slices = tf.gather(dense_tensor, indices)
# NOTE(siege): This kills the shape annotation.
return tf.reshape(slices, tf.concat([[-1], tail_dims], 0))
def stack_tensor(slices, indices, dense_tensor, head_dims):
"""Reconsititutes a tensor from slices and corresponding indices.
This is an inverse operation to slice_tensor. Missing slices are set to 0.
Args:
slices: a tensor. Shape [K, D_1, ...]
indices: a 1-D integer tensor. Shape: [K]
dense_tensor: the original tensor the slices were taken
from. Shape: [D_0, D_1, ...]
head_dims: True dimensions of the dense_tensor's first dimension.
Returns:
Reconsituted tensor. Shape: [D_0, D_1, ...]
"""
# NOTE(siege): This cast shouldn't be necessary.
indices = tf.cast(indices, tf.int32)
tail_dims = tf.shape(dense_tensor)[1:]
dense_shape = tf.concat([head_dims, tail_dims], 0)
slices = tf.reshape(slices, tf.concat([[-1], dense_shape[1:]], 0))
indices = tf.expand_dims(indices, -1)
return tf.reshape(tf.scatter_nd(indices, slices, dense_shape),
tf.shape(dense_tensor))
def update_slices(slices, indices, dense_tensor, head_dims):
"""Reconstitutes a tensor from slices and corresponding indices.
Like _stack_tensor, but instead of setting missing slices to 0, sets them to
what they were in the original tensor. The return value is reshaped to be
the same as dense_tensor.
Args:
slices: a tensor. Shape [K, D_1, ...]
indices: a 1-D integer tensor. Shape: [K]
dense_tensor: the original tensor the slices were taken
from. Shape: [D_0, D_1, ...]
head_dims: True dimensions of the dense_tensor's first dimension.
Returns:
Reconsituted tensor. Shape: [D_0, D_1, ...]
"""
# NOTE(siege): This cast shouldn't be necessary.
indices = tf.cast(indices, tf.int32)
tail_dims = tf.shape(dense_tensor)[1:]
dense_shape = tf.concat([head_dims, tail_dims], 0)
update_mask_vals = tf.fill(tf.shape(indices), 1)
reshaped_indices = tf.expand_dims(indices, -1)
update_mask = tf.equal(
tf.scatter_nd(reshaped_indices, update_mask_vals, head_dims[:1]), 1)
reshaped_dense_slices = tf.reshape(
stack_tensor(slices, indices, dense_tensor, head_dims), dense_shape)
reshaped_dense_tensor = tf.reshape(dense_tensor, dense_shape)
return tf.reshape(
tf.where(update_mask, reshaped_dense_slices, reshaped_dense_tensor),
tf.shape(dense_tensor))
package(default_visibility = ["//visibility:public"])
# Libraries
# =====
py_library(
name = "datasets",
srcs = ["datasets.py"],
deps = [
],
)
py_library(
name = "model_adapter",
srcs = ["model_adapter.py"],
deps = [
":problem_generator",
],
)
py_library(
name = "problem_generator",
srcs = ["problem_generator.py"],
deps = [
":problem_spec",
],
)
py_library(
name = "problem_sets",
srcs = ["problem_sets.py"],
deps = [
":datasets",
":model_adapter",
":problem_generator",
],
)
py_library(
name = "problem_spec",
srcs = ["problem_spec.py"],
deps = [],
)
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Functions to generate or load datasets for supervised learning."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from collections import namedtuple
import numpy as np
from sklearn.datasets import make_classification
MAX_SEED = 4294967295
class Dataset(namedtuple("Dataset", "data labels")):
"""Helper class for managing a supervised learning dataset.
Args:
data: an array of type float32 with N samples, each of which is the set
of features for that sample. (Shape (N, D_i), where N is the number of
samples and D_i is the number of features for that sample.)
labels: an array of type int32 or int64 with N elements, indicating the
class label for the corresponding set of features in data.
"""
# Since this is an immutable object, we don't need to reserve slots.
__slots__ = ()
@property
def size(self):
"""Dataset size (number of samples)."""
return len(self.data)
def batch_indices(self, num_batches, batch_size):
"""Creates indices of shuffled minibatches.
Args:
num_batches: the number of batches to generate
batch_size: the size of each batch
Returns:
batch_indices: a list of minibatch indices, arranged so that the dataset
is randomly shuffled.
Raises:
ValueError: if the data and labels have different lengths
"""
if len(self.data) != len(self.labels):
raise ValueError("Labels and data must have the same number of samples.")
batch_indices = []
# Follows logic in mnist.py to ensure we cover the entire dataset.
index_in_epoch = 0
dataset_size = len(self.data)
dataset_indices = np.arange(dataset_size)
np.random.shuffle(dataset_indices)
for _ in range(num_batches):
start = index_in_epoch
index_in_epoch += batch_size
if index_in_epoch > dataset_size:
# Finished epoch, reshuffle.
np.random.shuffle(dataset_indices)
# Start next epoch.
start = 0
index_in_epoch = batch_size
end = index_in_epoch
batch_indices.append(dataset_indices[start:end].tolist())
return batch_indices
def noisy_parity_class(n_samples,
n_classes=2,
n_context_ids=5,
noise_prob=0.25,
random_seed=None):
"""Returns a randomly generated sparse-to-sparse dataset.
The label is a parity class of a set of context classes.
Args:
n_samples: number of samples (data points)
n_classes: number of class labels (default: 2)
n_context_ids: how many classes to take the parity of (default: 5).
noise_prob: how often to corrupt the label (default: 0.25)
random_seed: seed used for drawing the random data (default: None)
Returns:
dataset: A Dataset namedtuple containing the generated data and labels
"""
np.random.seed(random_seed)
x = np.random.randint(0, n_classes, [n_samples, n_context_ids])
noise = np.random.binomial(1, noise_prob, [n_samples])
y = (np.sum(x, 1) + noise) % n_classes
return Dataset(x.astype("float32"), y.astype("int32"))
def random(n_features, n_samples, n_classes=2, sep=1.0, random_seed=None):
"""Returns a randomly generated classification dataset.
Args:
n_features: number of features (dependent variables)
n_samples: number of samples (data points)
n_classes: number of class labels (default: 2)
sep: separation of the two classes, a higher value corresponds to
an easier classification problem (default: 1.0)
random_seed: seed used for drawing the random data (default: None)
Returns:
dataset: A Dataset namedtuple containing the generated data and labels
"""
# Generate the problem data.
x, y = make_classification(n_samples=n_samples,
n_features=n_features,
n_informative=n_features,
n_redundant=0,
n_classes=n_classes,
class_sep=sep,
random_state=random_seed)
return Dataset(x.astype("float32"), y.astype("int32"))
def random_binary(n_features, n_samples, random_seed=None):
"""Returns a randomly generated dataset of binary values.
Args:
n_features: number of features (dependent variables)
n_samples: number of samples (data points)
random_seed: seed used for drawing the random data (default: None)
Returns:
dataset: A Dataset namedtuple containing the generated data and labels
"""
random_seed = (np.random.randint(MAX_SEED) if random_seed is None
else random_seed)
np.random.seed(random_seed)
x = np.random.randint(2, size=(n_samples, n_features))
y = np.zeros((n_samples, 1))
return Dataset(x.astype("float32"), y.astype("int32"))
def random_symmetric(n_features, n_samples, random_seed=None):
"""Returns a randomly generated dataset of values and their negatives.
Args:
n_features: number of features (dependent variables)
n_samples: number of samples (data points)
random_seed: seed used for drawing the random data (default: None)
Returns:
dataset: A Dataset namedtuple containing the generated data and labels
"""
random_seed = (np.random.randint(MAX_SEED) if random_seed is None
else random_seed)
np.random.seed(random_seed)
x1 = np.random.normal(size=(int(n_samples/2), n_features))
x = np.concatenate((x1, -x1), axis=0)
y = np.zeros((n_samples, 1))
return Dataset(x.astype("float32"), y.astype("int32"))
def random_mlp(n_features, n_samples, random_seed=None, n_layers=6, width=20):
"""Returns a generated output of an MLP with random weights.
Args:
n_features: number of features (dependent variables)
n_samples: number of samples (data points)
random_seed: seed used for drawing the random data (default: None)
n_layers: number of layers in random MLP
width: width of the layers in random MLP
Returns:
dataset: A Dataset namedtuple containing the generated data and labels
"""
random_seed = (np.random.randint(MAX_SEED) if random_seed is None
else random_seed)
np.random.seed(random_seed)
x = np.random.normal(size=(n_samples, n_features))
y = x
n_in = n_features
scale_factor = np.sqrt(2.) / np.sqrt(n_features)
for _ in range(n_layers):
weights = np.random.normal(size=(n_in, width)) * scale_factor
y = np.dot(y, weights).clip(min=0)
n_in = width
y = y[:, 0]
y[y > 0] = 1
return Dataset(x.astype("float32"), y.astype("int32"))
EMPTY_DATASET = Dataset(np.array([], dtype="float32"),
np.array([], dtype="int32"))
# Copyright 2017 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Implementation of the ModelAdapter class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import mock
import tensorflow as tf
from learned_optimizer.problems import problem_generator as pg
class ModelAdapter(pg.Problem):
"""Adapts Tensorflow models/graphs into a form suitable for meta-training.
This class adapts an existing TensorFlow graph into a form suitable for
meta-training a learned optimizer.
"""
def __init__(self, make_loss_and_init_fn):
"""Wraps a model in the Problem interface.
make_loss_and_init argument is a callable that returns a tuple of
two other callables as follows.
The first will construct most of the graph and return the problem loss. It
is essential that this graph contains the totality of the model's variables,
but none of its queues.
The second will return construct the model initialization graph given a list
of parameters and return a callable that is passed an instance of
tf.Session, and should initialize the models' parameters.
An argument value function would look like this:
```python
def make_loss_and_init_fn():
inputs = queued_reader()
def make_loss():
return create_model_with_variables(inputs)
def make_init_fn(parameters):
saver = tf.Saver(parameters)
def init_fn(sess):
sess.restore(sess, ...)
return init_fn
return make_loss, make_init_fn
```
Args:
make_loss_and_init_fn: a callable, as described aboce
"""
make_loss_fn, make_init_fn = make_loss_and_init_fn()
self.make_loss_fn = make_loss_fn
self.parameters, self.constants = _get_variables(make_loss_fn)
if make_init_fn is not None:
init_fn = make_init_fn(self.parameters + self.constants)
else:
init_op = tf.initialize_variables(self.parameters + self.constants)
init_fn = lambda sess: sess.run(init_op)
tf.logging.info("ModelAdapter parameters: %s",
[op.name for op in self.parameters])
tf.logging.info("ModelAdapter constants: %s",
[op.name for op in self.constants])
super(ModelAdapter, self).__init__(
[], random_seed=None, noise_stdev=0.0, init_fn=init_fn)
def init_tensors(self, seed=None):
"""Returns a list of tensors with the given shape."""
return self.parameters
def init_variables(self, seed=None):
"""Returns a list of variables with the given shape."""
# NOTE(siege): This is awkward, as these are not set as trainable.
return self.parameters
def objective(self, parameters, data=None, labels=None):
"""Computes the objective given a list of parameters.
Args:
parameters: The parameters to optimize (as a list of tensors)
data: An optional batch of data for calculating objectives
labels: An optional batch of corresponding labels
Returns:
A scalar tensor representing the objective value
"""
# We need to set up a mapping based on the original parameter names, because
# the parameters passed can be arbitrary tensors.
parameter_mapping = {
old_p.name: p
for old_p, p in zip(self.parameters, parameters)
}
with tf.variable_scope(tf.get_variable_scope(), reuse=True):
return _make_with_custom_variables(self.make_loss_fn, parameter_mapping)
def _get_variables(func):
"""Calls func, returning any variables created.
The created variables are modified to not be trainable, and are placed into
the LOCAL_VARIABLES collection.
Args:
func: Function to be called.
Returns:
A tuple (variables, constants) where the first element is a list of
trainable variables and the second is the non-trainable variables.
"""
variables = []
constants = []
# We need to create these variables like normal, so grab the original
# constructor before we mock it.
original_init = tf.Variable.__init__
def custom_init(self, *args, **kwargs):
trainable = kwargs["trainable"]
kwargs["trainable"] = False
# Making these variables local keeps them out of the optimizer's checkpoints
# somehow.
kwargs["collections"] = [tf.GraphKeys.LOCAL_VARIABLES]
original_init(self, *args, **kwargs)
if trainable:
variables.append(self)
else:
constants.append(self)
# This name-scope is just a nicety for TensorBoard.
with tf.name_scope("unused_graph"):
with mock.patch.object(tf.Variable, "__init__", custom_init):
func()
return variables, constants
def _make_with_custom_variables(func, variable_mapping):
"""Calls func and replaces the value of some variables created in it.
Args:
func: Function to be called.
variable_mapping: A mapping of variable name to the replacement tensor or
tf.Variable.
Returns:
The return value of func is returned.
"""
original_value = tf.Variable.value
def custom_value(self):
if self.name in variable_mapping:
replacement = variable_mapping[self.name]
tf.logging.info("Replaced %s with %s" % (self.name, replacement))
# value() method needs to return a tensor, we need to call value on it.
# This has to be done manually like this otherwise we'll get an infinite
# loop.
if isinstance(replacement, tf.Variable):
replacement = original_value(replacement)
return replacement
else:
return original_value(self)
with mock.patch.object(tf.Variable, "value", custom_value):
with mock.patch.object(tf.Variable, "_AsTensor", custom_value):
return func()
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment