Added code for Learned Optimizers that Scale and Generalize

f2120b07 · Olga Wichrowska · 6024579b · f2120b07 · f2120b07 · f2120b07
Commit f2120b07 authored Sep 06, 2017 by Olga Wichrowska
20 changed files
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -10,6 +10,7 @@ differential_privacy/* @panyx0718
 domain_adaptation/* @bousmalis @ddohan
 im2txt/* @cshallue
 inception/* @shlens @vincentvanhoucke
+learned_optimizer/* @olganw @nirum
 learning_to_remember_rare_events/* @lukaszkaiser @ofirnachum
 lfads/* @jazcollins @susillo
 lm_1b/* @oriolvinyals @panyx0718

--- a/learned_optimizer/.gitignore
+++ b/learned_optimizer/.gitignore
--- a/learned_optimizer/BUILD
+++ b/learned_optimizer/BUILD
+# Learning to Optimize Learning (LOL)
+
+package(default_visibility = ["//visibility:public"])
+
+# Libraries
+# =========
+
+py_library(
+    name = "metaopt",
+    srcs = ["metaopt.py"],
+    deps = [
+        "//learned_optimizer/problems:datasets",
+        "//learned_optimizer/problems:problem_generator",
+    ],
+)
+
+# Binaries
+# ========
+py_binary(
+    name = "metarun",
+    srcs = ["metarun.py"],
+    deps = [
+        ":metaopt",
+        "//learned_optimizer/optimizer:coordinatewise_rnn",
+        "//learned_optimizer/optimizer:global_learning_rate",
+        "//learned_optimizer/optimizer:hierarchical_rnn",
+        "//learned_optimizer/optimizer:learning_rate_schedule",
+        "//learned_optimizer/optimizer:trainable_adam",
+        "//learned_optimizer/problems:problem_sets",
+        "//learned_optimizer/problems:problem_spec",
+    ],
+)
+
--- a/learned_optimizer/README.md
+++ b/learned_optimizer/README.md
+# Learned Optimizer
+
+Code for [Learned Optimizers that Scale and Generalize](https://arxiv.org/abs/1703.04813).
+
+## Requirements
+
+* Bazel ([install](https://bazel.build/versions/master/docs/install.html))
+* TensorFlow >= v1.3
+
+## Training a Learned Optimizer
+
+## Code Overview
+In the top-level directory, ```metaopt.py``` contains the code to train and test a learned optimizer. ```metarun.py``` packages the actual training procedure into a
+single file, defining and exposing many flags to tune the procedure, from selecting the optimizer type and problem set to more fine-grained hyperparameter settings.
+There is no testing binary; testing can be done ad-hoc via ```metaopt.test_optimizer``` by passing an optimizer object and a directory with a checkpoint.
+
+The ```optimizer``` directory contains a base ```trainable_optimizer.py``` class and a number of extensions, including the ```hierarchical_rnn``` optimizer used in
+the paper, a ```coordinatewise_rnn``` optimizer that more closely matches previous work, and a number of simpler optimizers to demonstrate the basic mechanics of
+a learnable optimizer.
+
+The ```problems``` directory contains the code to build the problems that were used in the meta-training set.
+
+### Binaries
+```metarun.py```: meta-training of a learned optimizer
+
+### Command-Line Flags
+The flags most relevant to meta-training are defined in ```metarun.py```. The default values will meta-train a HierarchicalRNN optimizer with the hyperparameter
+settings used in the paper.
+
+### Using a Learned Optimizer as a Black Box
+The ```trainable_optimizer``` inherits from ```tf.train.Optimizer```, so a properly instantiated version can be used to train any model in any APIs that accept
+this class. There are just 2 caveats:
+
+1. If using the Hierarchical RNN optimizer, the apply_gradients return type must be changed (see comments inline for what exactly must be removed)
+
+2. Care must be taken to restore the variables from the optimizer without overriding them. Optimizer variables should be loaded manually using a pretrained checkpoint
+and a ```tf.train.Saver``` with only the optimizer variables. Then, when constructing the session, ensure that any automatic variable initialization does not
+re-initialize the loaded optimizer variables.
+
+## Contact for Issues
+
+* Olga Wichrowska (@olganw), Niru Maheswaranathan (@nirum)
--- a/learned_optimizer/metaopt.py
+++ b/learned_optimizer/metaopt.py
--- a/learned_optimizer/metarun.py
+++ b/learned_optimizer/metarun.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Scripts for meta-optimization."""
+
+from __future__ import print_function
+
+import os
+
+import tensorflow as tf
+
+import metaopt
+from learned_optimizer.optimizer import coordinatewise_rnn
+from learned_optimizer.optimizer import global_learning_rate
+from learned_optimizer.optimizer import hierarchical_rnn
+from learned_optimizer.optimizer import learning_rate_schedule
+from learned_optimizer.optimizer import trainable_adam
+from learned_optimizer.problems import problem_sets as ps
+from learned_optimizer.problems import problem_spec
+
+tf.app.flags.DEFINE_string("train_dir", "/tmp/lol/",
+                           """Directory to store parameters and results.""")
+
+tf.app.flags.DEFINE_integer("task", 0,
+                            """Task id of the replica running the training.""")
+tf.app.flags.DEFINE_integer("worker_tasks", 1,
+                            """Number of tasks in the worker job.""")
+
+tf.app.flags.DEFINE_integer("num_problems", 1000,
+                            """Number of sub-problems to run.""")
+tf.app.flags.DEFINE_integer("num_meta_iterations", 5,
+                            """Number of meta-iterations to optimize.""")
+tf.app.flags.DEFINE_integer("num_unroll_scale", 40,
+                            """The scale parameter of the exponential
+                            distribution from which the number of partial
+                            unrolls is drawn""")
+tf.app.flags.DEFINE_integer("min_num_unrolls", 1,
+                            """The minimum number of unrolls per problem.""")
+tf.app.flags.DEFINE_integer("num_partial_unroll_itr_scale", 200,
+                            """The scale parameter of the exponential
+                               distribution from which the number of iterations
+                               per unroll is drawn.""")
+tf.app.flags.DEFINE_integer("min_num_itr_partial_unroll", 50,
+                            """The minimum number of iterations for one
+                               unroll.""")
+
+tf.app.flags.DEFINE_string("optimizer", "HierarchicalRNN",
+                           """Which meta-optimizer to train.""")
+
+# CoordinatewiseRNN-specific flags
+tf.app.flags.DEFINE_integer("cell_size", 20,
+                            """Size of the RNN hidden state in each layer.""")
+tf.app.flags.DEFINE_integer("num_cells", 2,
+                            """Number of RNN layers.""")
+tf.app.flags.DEFINE_string("cell_cls", "GRUCell",
+                           """Type of RNN cell to use.""")
+
+# Metaoptimization parameters
+tf.app.flags.DEFINE_float("meta_learning_rate", 1e-6,
+                          """The learning rate for the meta-optimizer.""")
+tf.app.flags.DEFINE_float("gradient_clip_level", 1e4,
+                          """The level to clip gradients to.""")
+
+# Training set selection
+tf.app.flags.DEFINE_boolean("include_quadratic_problems", False,
+                            """Include non-noisy quadratic problems.""")
+tf.app.flags.DEFINE_boolean("include_noisy_quadratic_problems", True,
+                            """Include noisy quadratic problems.""")
+tf.app.flags.DEFINE_boolean("include_large_quadratic_problems", True,
+                            """Include very large quadratic problems.""")
+tf.app.flags.DEFINE_boolean("include_bowl_problems", True,
+                            """Include 2D bowl problems.""")
+tf.app.flags.DEFINE_boolean("include_softmax_2_class_problems", True,
+                            """Include 2-class logistic regression problems.""")
+tf.app.flags.DEFINE_boolean("include_noisy_softmax_2_class_problems", True,
+                            """Include noisy 2-class logistic regression
+                               problems.""")
+tf.app.flags.DEFINE_boolean("include_optimization_test_problems", True,
+                            """Include non-noisy versions of classic
+                               optimization test problems, e.g. Rosenbrock.""")
+tf.app.flags.DEFINE_boolean("include_noisy_optimization_test_problems", True,
+                            """Include gradient-noise versions of classic
+                               optimization test problems, e.g. Rosenbrock""")
+tf.app.flags.DEFINE_boolean("include_fully_connected_random_2_class_problems",
+                            True, """Include MLP problems for 2 classes.""")
+tf.app.flags.DEFINE_boolean("include_matmul_problems", True,
+                            """Include matrix multiplication problems.""")
+tf.app.flags.DEFINE_boolean("include_log_objective_problems", True,
+                            """Include problems where the objective is the log
+                               objective of another problem, e.g. Bowl.""")
+tf.app.flags.DEFINE_boolean("include_rescale_problems", True,
+                            """Include problems where the parameters are scaled
+                               version of the original parameters.""")
+tf.app.flags.DEFINE_boolean("include_norm_problems", True,
+                            """Include problems where the objective is the
+                               N-norm of another problem, e.g. Quadratic.""")
+tf.app.flags.DEFINE_boolean("include_sum_problems", True,
+                            """Include problems where the objective is the sum
+                               of the objectives of the subproblems that make
+                               up the problem parameters. Per-problem tensors
+                               are still independent of each other.""")
+tf.app.flags.DEFINE_boolean("include_sparse_gradient_problems", True,
+                            """Include problems where the gradient is set to 0
+                               with some high probability.""")
+tf.app.flags.DEFINE_boolean("include_sparse_softmax_problems", False,
+                            """Include sparse softmax problems.""")
+tf.app.flags.DEFINE_boolean("include_one_hot_sparse_softmax_problems", False,
+                            """Include one-hot sparse softmax problems.""")
+tf.app.flags.DEFINE_boolean("include_noisy_bowl_problems", True,
+                            """Include noisy bowl problems.""")
+tf.app.flags.DEFINE_boolean("include_noisy_norm_problems", True,
+                            """Include noisy norm problems.""")
+tf.app.flags.DEFINE_boolean("include_noisy_sum_problems", True,
+                            """Include noisy sum problems.""")
+tf.app.flags.DEFINE_boolean("include_sum_of_quadratics_problems", False,
+                            """Include sum of quadratics problems.""")
+tf.app.flags.DEFINE_boolean("include_projection_quadratic_problems", False,
+                            """Include projection quadratic problems.""")
+tf.app.flags.DEFINE_boolean("include_outward_snake_problems", False,
+                            """Include outward snake problems.""")
+tf.app.flags.DEFINE_boolean("include_dependency_chain_problems", False,
+                            """Include dependency chain problems.""")
+tf.app.flags.DEFINE_boolean("include_min_max_well_problems", False,
+                            """Include min-max well problems.""")
+
+# Optimizer parameters: initialization and scale values
+tf.app.flags.DEFINE_float("min_lr", 1e-6,
+                          """The minimum initial learning rate.""")
+tf.app.flags.DEFINE_float("max_lr", 1e-2,
+                          """The maximum initial learning rate.""")
+
+# Optimizer parameters: small features.
+tf.app.flags.DEFINE_boolean("zero_init_lr_weights", True,
+                            """Whether to initialize the learning rate weights
+                               to 0 rather than the scaled random initialization
+                               used for other RNN variables.""")
+tf.app.flags.DEFINE_boolean("use_relative_lr", True,
+                            """Whether to use the relative learning rate as an
+                               input during training. Can only be used if
+                               learnable_decay is also True.""")
+tf.app.flags.DEFINE_boolean("use_extreme_indicator", False,
+                            """Whether to use the extreme indicator for learning
+                               rates as an input during training. Can only be
+                               used if learnable_decay is also True.""")
+tf.app.flags.DEFINE_boolean("use_log_means_squared", True,
+                            """Whether to track the log of the mean squared
+                               grads instead of the means squared grads.""")
+tf.app.flags.DEFINE_boolean("use_problem_lr_mean", True,
+                            """Whether to use the mean over all learning rates
+                               in the problem when calculating the relative
+                               learning rate.""")
+
+# Optimizer parameters: major features
+tf.app.flags.DEFINE_boolean("learnable_decay", True,
+                            """Whether to learn weights that dynamically
+                              modulate the input scale via RMS decay.""")
+tf.app.flags.DEFINE_boolean("dynamic_output_scale", True,
+                            """Whether to learn weights that dynamically
+                               modulate the output scale.""")
+tf.app.flags.DEFINE_boolean("use_log_objective", True,
+                            """Whether to use the log of the scaled objective
+                               rather than just the scaled obj for training.""")
+tf.app.flags.DEFINE_boolean("use_attention", False,
+                            """Whether to learn where to attend.""")
+tf.app.flags.DEFINE_boolean("use_second_derivatives", True,
+                            """Whether to use second derivatives.""")
+tf.app.flags.DEFINE_integer("num_gradient_scales", 4,
+                            """How many different timescales to keep for
+                               gradient history. If > 1, also learns a scale
+                               factor for gradient history.""")
+tf.app.flags.DEFINE_float("max_log_lr", 33,
+                          """The maximum log learning rate allowed.""")
+tf.app.flags.DEFINE_float("objective_training_max_multiplier", -1,
+                          """How much the objective can grow before training on
+                             this problem / param pair is terminated. Sets a max
+                             on the objective value when multiplied by the
+                             initial objective. If <= 0, not used.""")
+tf.app.flags.DEFINE_boolean("use_gradient_shortcut", True,
+                            """Whether to add a learned affine projection of the
+                               gradient to the update delta in addition to the
+                               gradient function computed by the RNN.""")
+tf.app.flags.DEFINE_boolean("use_lr_shortcut", False,
+                            """Whether to add the difference between the current
+                               learning rate and the desired learning rate to
+                               the RNN input.""")
+tf.app.flags.DEFINE_boolean("use_grad_products", True,
+                            """Whether to use gradient products in the input to
+                               the RNN. Only applicable when num_gradient_scales
+                               > 1.""")
+tf.app.flags.DEFINE_boolean("use_multiple_scale_decays", False,
+                            """Whether to use many-timescale scale decays.""")
+tf.app.flags.DEFINE_boolean("use_numerator_epsilon", False,
+                            """Whether to use epsilon in the numerator of the
+                               log objective.""")
+tf.app.flags.DEFINE_boolean("learnable_inp_decay", True,
+                            """Whether to learn input decay weight and bias.""")
+tf.app.flags.DEFINE_boolean("learnable_rnn_init", True,
+                            """Whether to learn RNN state initialization.""")
+
+FLAGS = tf.app.flags.FLAGS
+
+# The Size of the RNN hidden state in each layer:
+# [PerParam, PerTensor, Global]. The length of this list must be 1, 2, or 3.
+# If less than 3, the Global and/or PerTensor RNNs will not be created.
+
+HRNN_CELL_SIZES = [10, 20, 20]
+
+
+
+def register_optimizers():
+  opts = {}
+  opts["CoordinatewiseRNN"] = coordinatewise_rnn.CoordinatewiseRNN
+  opts["GlobalLearningRate"] = global_learning_rate.GlobalLearningRate
+  opts["HierarchicalRNN"] = hierarchical_rnn.HierarchicalRNN
+  opts["LearningRateSchedule"] = learning_rate_schedule.LearningRateSchedule
+  opts["TrainableAdam"] = trainable_adam.TrainableAdam
+  return opts
+
+
+def main(unused_argv):
+  """Runs the main script."""
+
+  opts = register_optimizers()
+
+  # Choose a set of problems to optimize. By default this includes quadratics,
+  # 2-dimensional bowls, 2-class softmax problems, and non-noisy optimization
+  # test problems (e.g. Rosenbrock, Beale)
+  problems_and_data = []
+
+  if FLAGS.include_sparse_softmax_problems:
+    problems_and_data.extend(ps.sparse_softmax_2_class_sparse_problems())
+
+  if FLAGS.include_one_hot_sparse_softmax_problems:
+    problems_and_data.extend(
+        ps.one_hot_sparse_softmax_2_class_sparse_problems())
+
+  if FLAGS.include_quadratic_problems:
+    problems_and_data.extend(ps.quadratic_problems())
+
+  if FLAGS.include_noisy_quadratic_problems:
+    problems_and_data.extend(ps.quadratic_problems_noisy())
+
+  if FLAGS.include_large_quadratic_problems:
+    problems_and_data.extend(ps.quadratic_problems_large())
+
+  if FLAGS.include_bowl_problems:
+    problems_and_data.extend(ps.bowl_problems())
+
+  if FLAGS.include_noisy_bowl_problems:
+    problems_and_data.extend(ps.bowl_problems_noisy())
+
+  if FLAGS.include_softmax_2_class_problems:
+    problems_and_data.extend(ps.softmax_2_class_problems())
+
+  if FLAGS.include_noisy_softmax_2_class_problems:
+    problems_and_data.extend(ps.softmax_2_class_problems_noisy())
+
+  if FLAGS.include_optimization_test_problems:
+    problems_and_data.extend(ps.optimization_test_problems())
+
+  if FLAGS.include_noisy_optimization_test_problems:
+    problems_and_data.extend(ps.optimization_test_problems_noisy())
+
+  if FLAGS.include_fully_connected_random_2_class_problems:
+    problems_and_data.extend(ps.fully_connected_random_2_class_problems())
+
+  if FLAGS.include_matmul_problems:
+    problems_and_data.extend(ps.matmul_problems())
+
+  if FLAGS.include_log_objective_problems:
+    problems_and_data.extend(ps.log_objective_problems())
+
+  if FLAGS.include_rescale_problems:
+    problems_and_data.extend(ps.rescale_problems())
+
+  if FLAGS.include_norm_problems:
+    problems_and_data.extend(ps.norm_problems())
+
+  if FLAGS.include_noisy_norm_problems:
+    problems_and_data.extend(ps.norm_problems_noisy())
+
+  if FLAGS.include_sum_problems:
+    problems_and_data.extend(ps.sum_problems())
+
+  if FLAGS.include_noisy_sum_problems:
+    problems_and_data.extend(ps.sum_problems_noisy())
+
+  if FLAGS.include_sparse_gradient_problems:
+    problems_and_data.extend(ps.sparse_gradient_problems())
+    if FLAGS.include_fully_connected_random_2_class_problems:
+      problems_and_data.extend(ps.sparse_gradient_problems_mlp())
+
+  if FLAGS.include_min_max_well_problems:
+    problems_and_data.extend(ps.min_max_well_problems())
+
+  if FLAGS.include_sum_of_quadratics_problems:
+    problems_and_data.extend(ps.sum_of_quadratics_problems())
+
+  if FLAGS.include_projection_quadratic_problems:
+    problems_and_data.extend(ps.projection_quadratic_problems())
+
+  if FLAGS.include_outward_snake_problems:
+    problems_and_data.extend(ps.outward_snake_problems())
+
+  if FLAGS.include_dependency_chain_problems:
+    problems_and_data.extend(ps.dependency_chain_problems())
+
+  # log directory
+  logdir = os.path.join(FLAGS.train_dir,
+                        "{}_{}_{}_{}".format(FLAGS.optimizer,
+                                             FLAGS.cell_cls,
+                                             FLAGS.cell_size,
+                                             FLAGS.num_cells))
+
+  # get the optimizer class and arguments
+  optimizer_cls = opts[FLAGS.optimizer]
+
+  assert len(HRNN_CELL_SIZES) in [1, 2, 3]
+  optimizer_args = (HRNN_CELL_SIZES,)
+
+  optimizer_kwargs = {
+      "init_lr_range": (FLAGS.min_lr, FLAGS.max_lr),
+      "learnable_decay": FLAGS.learnable_decay,
+      "dynamic_output_scale": FLAGS.dynamic_output_scale,
+      "cell_cls": getattr(tf.contrib.rnn, FLAGS.cell_cls),
+      "use_attention": FLAGS.use_attention,
+      "use_log_objective": FLAGS.use_log_objective,
+      "num_gradient_scales": FLAGS.num_gradient_scales,
+      "zero_init_lr_weights": FLAGS.zero_init_lr_weights,
+      "use_log_means_squared": FLAGS.use_log_means_squared,
+      "use_relative_lr": FLAGS.use_relative_lr,
+      "use_extreme_indicator": FLAGS.use_extreme_indicator,
+      "max_log_lr": FLAGS.max_log_lr,
+      "obj_train_max_multiplier": FLAGS.objective_training_max_multiplier,
+      "use_problem_lr_mean": FLAGS.use_problem_lr_mean,
+      "use_gradient_shortcut": FLAGS.use_gradient_shortcut,
+      "use_second_derivatives": FLAGS.use_second_derivatives,
+      "use_lr_shortcut": FLAGS.use_lr_shortcut,
+      "use_grad_products": FLAGS.use_grad_products,
+      "use_multiple_scale_decays": FLAGS.use_multiple_scale_decays,
+      "use_numerator_epsilon": FLAGS.use_numerator_epsilon,
+      "learnable_inp_decay": FLAGS.learnable_inp_decay,
+      "learnable_rnn_init": FLAGS.learnable_rnn_init,
+  }
+  optimizer_spec = problem_spec.Spec(
+      optimizer_cls, optimizer_args, optimizer_kwargs)
+
+  # make log directory
+  tf.gfile.MakeDirs(logdir)
+
+  is_chief = FLAGS.task == 0
+  # if this is a distributed run, make the chief run through problems in order
+  select_random_problems = FLAGS.worker_tasks == 1 or not is_chief
+
+  def num_unrolls():
+    return metaopt.sample_numiter(FLAGS.num_unroll_scale, FLAGS.min_num_unrolls)
+
+  def num_partial_unroll_itrs():
+    return metaopt.sample_numiter(FLAGS.num_partial_unroll_itr_scale,
+                                  FLAGS.min_num_itr_partial_unroll)
+
+  # run it
+  metaopt.train_optimizer(
+      logdir,
+      optimizer_spec,
+      problems_and_data,
+      FLAGS.num_problems,
+      FLAGS.num_meta_iterations,
+      num_unrolls,
+      num_partial_unroll_itrs,
+      learning_rate=FLAGS.meta_learning_rate,
+      gradient_clip=FLAGS.gradient_clip_level,
+      is_chief=is_chief,
+      select_random_problems=select_random_problems,
+      obj_train_max_multiplier=FLAGS.objective_training_max_multiplier,
+      callbacks=[])
+
+  return 0
+
+
+if __name__ == "__main__":
+  tf.app.run()
--- a/learned_optimizer/optimizer/BUILD
+++ b/learned_optimizer/optimizer/BUILD
+package(default_visibility = ["//visibility:public"])
+
+# Libraries
+# =========
+py_library(
+    name = "coordinatewise_rnn",
+    srcs = ["coordinatewise_rnn.py"],
+    deps = [
+        ":trainable_optimizer",
+        ":utils",
+    ],
+)
+
+py_library(
+    name = "global_learning_rate",
+    srcs = ["global_learning_rate.py"],
+    deps = [
+        ":trainable_optimizer",
+    ],
+)
+
+py_library(
+    name = "hierarchical_rnn",
+    srcs = ["hierarchical_rnn.py"],
+    deps = [
+        ":rnn_cells",
+        ":trainable_optimizer",
+        ":utils",
+    ],
+)
+
+py_library(
+    name = "learning_rate_schedule",
+    srcs = ["learning_rate_schedule.py"],
+    deps = [
+        ":trainable_optimizer",
+    ],
+)
+
+py_library(
+    name = "rnn_cells",
+    srcs = ["rnn_cells.py"],
+    deps = [
+        ":utils",
+    ],
+)
+
+py_library(
+    name = "trainable_adam",
+    srcs = ["trainable_adam.py"],
+    deps = [
+        ":trainable_optimizer",
+        ":utils",
+    ],
+)
+
+py_library(
+    name = "trainable_optimizer",
+    srcs = ["trainable_optimizer.py"],
+    deps = [
+    ],
+)
+
+py_library(
+    name = "utils",
+    srcs = ["utils.py"],
+    deps = [
+    ],
+)
--- a/learned_optimizer/optimizer/coordinatewise_rnn.py
+++ b/learned_optimizer/optimizer/coordinatewise_rnn.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Collection of trainable optimizers for meta-optimization."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import numpy as np
+import tensorflow as tf
+
+from learned_optimizer.optimizer import utils
+from learned_optimizer.optimizer import trainable_optimizer as opt
+
+
+# Default was 1e-3
+tf.app.flags.DEFINE_float("crnn_rnn_readout_scale", 0.5,
+                          """The initialization scale for the RNN readouts.""")
+tf.app.flags.DEFINE_float("crnn_default_decay_var_init", 2.2,
+                          """The default initializer value for any decay/
+                             momentum style variables and constants.
+                             sigmoid(2.2) ~ 0.9, sigmoid(-2.2) ~ 0.01.""")
+
+FLAGS = tf.flags.FLAGS
+
+
+class CoordinatewiseRNN(opt.TrainableOptimizer):
+  """RNN that operates on each coordinate of the problem independently."""
+
+  def __init__(self,
+               cell_sizes,
+               cell_cls,
+               init_lr_range=(1., 1.),
+               dynamic_output_scale=True,
+               learnable_decay=True,
+               zero_init_lr_weights=False,
+               **kwargs):
+    """Initializes the RNN per-parameter optimizer.
+
+    Args:
+      cell_sizes: List of hidden state sizes for each RNN cell in the network
+      cell_cls: tf.contrib.rnn class for specifying the RNN cell type
+      init_lr_range: the range in which to initialize the learning rates.
+      dynamic_output_scale: whether to learn weights that dynamically modulate
+          the output scale (default: True)
+      learnable_decay: whether to learn weights that dynamically modulate the
+          input scale via RMS style decay (default: True)
+      zero_init_lr_weights: whether to initialize the lr weights to zero
+      **kwargs: args passed to TrainableOptimizer's constructor
+
+    Raises:
+      ValueError: If the init lr range is not of length 2.
+      ValueError: If the init lr range is not a valid range (min > max).
+    """
+    if len(init_lr_range) != 2:
+      raise ValueError(
+          "Initial LR range must be len 2, was {}".format(len(init_lr_range)))
+    if init_lr_range[0] > init_lr_range[1]:
+      raise ValueError("Initial LR range min is greater than max.")
+    self.init_lr_range = init_lr_range
+
+    self.zero_init_lr_weights = zero_init_lr_weights
+    self.reuse_vars = False
+
+    # create the RNN cell
+    with tf.variable_scope(opt.OPTIMIZER_SCOPE):
+      self.component_cells = [cell_cls(sz) for sz in cell_sizes]
+      self.cell = tf.contrib.rnn.MultiRNNCell(self.component_cells)
+
+      # random normal initialization scaled by the output size
+      scale_factor = FLAGS.crnn_rnn_readout_scale / math.sqrt(cell_sizes[-1])
+      scaled_init = tf.random_normal_initializer(0., scale_factor)
+
+      # weights for projecting the hidden state to a parameter update
+      self.update_weights = tf.get_variable("update_weights",
+                                            shape=(cell_sizes[-1], 1),
+                                            initializer=scaled_init)
+
+      self._initialize_decay(learnable_decay, (cell_sizes[-1], 1), scaled_init)
+
+      self._initialize_lr(dynamic_output_scale, (cell_sizes[-1], 1),
+                          scaled_init)
+
+      state_size = sum([sum(state_size) for state_size in self.cell.state_size])
+      self._init_vector = tf.get_variable(
+          "init_vector", shape=[1, state_size],
+          initializer=tf.random_uniform_initializer(-1., 1.))
+
+    state_keys = ["rms", "rnn", "learning_rate", "decay"]
+    super(CoordinatewiseRNN, self).__init__("cRNN", state_keys, **kwargs)
+
+  def _initialize_decay(
+      self, learnable_decay, weights_tensor_shape, scaled_init):
+    """Initializes the decay weights and bias variables or tensors.
+
+    Args:
+      learnable_decay: Whether to use learnable decay.
+      weights_tensor_shape: The shape the weight tensor should take.
+      scaled_init: The scaled initialization for the weights tensor.
+    """
+    if learnable_decay:
+
+      # weights for projecting the hidden state to the RMS decay term
+      self.decay_weights = tf.get_variable("decay_weights",
+                                           shape=weights_tensor_shape,
+                                           initializer=scaled_init)
+      self.decay_bias = tf.get_variable(
+          "decay_bias", shape=(1,),
+          initializer=tf.constant_initializer(
+              FLAGS.crnn_default_decay_var_init))
+    else:
+      self.decay_weights = tf.zeros_like(self.update_weights)
+      self.decay_bias = tf.constant(FLAGS.crnn_default_decay_var_init)
+
+  def _initialize_lr(
+      self, dynamic_output_scale, weights_tensor_shape, scaled_init):
+    """Initializes the learning rate weights and bias variables or tensors.
+
+    Args:
+      dynamic_output_scale: Whether to use a dynamic output scale.
+      weights_tensor_shape: The shape the weight tensor should take.
+      scaled_init: The scaled initialization for the weights tensor.
+    """
+    if dynamic_output_scale:
+      zero_init = tf.constant_initializer(0.)
+      wt_init = zero_init if self.zero_init_lr_weights else scaled_init
+      self.lr_weights = tf.get_variable("learning_rate_weights",
+                                        shape=weights_tensor_shape,
+                                        initializer=wt_init)
+      self.lr_bias = tf.get_variable("learning_rate_bias", shape=(1,),
+                                     initializer=zero_init)
+    else:
+      self.lr_weights = tf.zeros_like(self.update_weights)
+      self.lr_bias = tf.zeros([1, 1])
+
+  def _initialize_state(self, var):
+    """Return a dictionary mapping names of state variables to their values."""
+    vectorized_shape = [var.get_shape().num_elements(), 1]
+
+    min_lr = self.init_lr_range[0]
+    max_lr = self.init_lr_range[1]
+    if min_lr == max_lr:
+      init_lr = tf.constant(min_lr, shape=vectorized_shape)
+    else:
+      actual_vals = tf.random_uniform(vectorized_shape,
+                                      np.log(min_lr),
+                                      np.log(max_lr))
+      init_lr = tf.exp(actual_vals)
+
+    ones = tf.ones(vectorized_shape)
+    rnn_init = ones * self._init_vector
+
+    return {
+        "rms": tf.ones(vectorized_shape),
+        "learning_rate": init_lr,
+        "rnn": rnn_init,
+        "decay": tf.ones(vectorized_shape),
+    }
+
+  def _compute_update(self, param, grad, state):
+    """Update parameters given the gradient and state.
+
+    Args:
+      param: tensor of parameters
+      grad: tensor of gradients with the same shape as param
+      state: a dictionary containing any state for the optimizer
+
+    Returns:
+      updated_param: updated parameters
+      updated_state: updated state variables in a dictionary
+    """
+
+    with tf.variable_scope(opt.OPTIMIZER_SCOPE) as scope:
+
+      if self.reuse_vars:
+        scope.reuse_variables()
+      else:
+        self.reuse_vars = True
+
+      param_shape = tf.shape(param)
+
+      (grad_values, decay_state, rms_state, rnn_state, learning_rate_state,
+       grad_indices) = self._extract_gradients_and_internal_state(
+           grad, state, param_shape)
+
+      # Vectorize and scale the gradients.
+      grad_scaled, rms = utils.rms_scaling(grad_values, decay_state, rms_state)
+
+      # Apply the RNN update.
+      rnn_state_tuples = self._unpack_rnn_state_into_tuples(rnn_state)
+      rnn_output, rnn_state_tuples = self.cell(grad_scaled, rnn_state_tuples)
+      rnn_state = self._pack_tuples_into_rnn_state(rnn_state_tuples)
+
+      # Compute the update direction (a linear projection of the RNN output).
+      delta = utils.project(rnn_output, self.update_weights)
+
+      # The updated decay is an affine projection of the hidden state
+      decay = utils.project(rnn_output, self.decay_weights,
+                            bias=self.decay_bias, activation=tf.nn.sigmoid)
+
+      # Compute the change in learning rate (an affine projection of the RNN
+      # state, passed through a 2x sigmoid, so the change is bounded).
+      learning_rate_change = 2. * utils.project(rnn_output, self.lr_weights,
+                                                bias=self.lr_bias,
+                                                activation=tf.nn.sigmoid)
+
+      # Update the learning rate.
+      new_learning_rate = learning_rate_change * learning_rate_state
+
+      # Apply the update to the parameters.
+      update = tf.reshape(new_learning_rate * delta, tf.shape(grad_values))
+
+      if isinstance(grad, tf.IndexedSlices):
+        update = utils.stack_tensor(update, grad_indices, param,
+                                    param_shape[:1])
+        rms = utils.update_slices(rms, grad_indices, state["rms"], param_shape)
+        new_learning_rate = utils.update_slices(new_learning_rate, grad_indices,
+                                                state["learning_rate"],
+                                                param_shape)
+        rnn_state = utils.update_slices(rnn_state, grad_indices, state["rnn"],
+                                        param_shape)
+        decay = utils.update_slices(decay, grad_indices, state["decay"],
+                                    param_shape)
+
+      new_param = param - update
+
+      # Collect the update and new state.
+      new_state = {
+          "rms": rms,
+          "learning_rate": new_learning_rate,
+          "rnn": rnn_state,
+          "decay": decay,
+      }
+
+    return new_param, new_state
+
+  def _extract_gradients_and_internal_state(self, grad, state, param_shape):
+    """Extracts the gradients and relevant internal state.
+
+    If the gradient is sparse, extracts the appropriate slices from the state.
+
+    Args:
+      grad: The current gradient.
+      state: The current state.
+      param_shape: The shape of the parameter (used if gradient is sparse).
+
+    Returns:
+      grad_values: The gradient value tensor.
+      decay_state: The current decay state.
+      rms_state: The current rms state.
+      rnn_state: The current state of the internal rnns.
+      learning_rate_state: The current learning rate state.
+      grad_indices: The indices for the gradient tensor, if sparse.
+          None otherwise.
+    """
+    if isinstance(grad, tf.IndexedSlices):
+      grad_indices, grad_values = utils.accumulate_sparse_gradients(grad)
+      decay_state = utils.slice_tensor(state["decay"], grad_indices,
+                                       param_shape)
+      rms_state = utils.slice_tensor(state["rms"], grad_indices, param_shape)
+      rnn_state = utils.slice_tensor(state["rnn"], grad_indices, param_shape)
+      learning_rate_state = utils.slice_tensor(state["learning_rate"],
+                                               grad_indices, param_shape)
+      decay_state.set_shape([None, 1])
+      rms_state.set_shape([None, 1])
+    else:
+      grad_values = grad
+      grad_indices = None
+
+      decay_state = state["decay"]
+      rms_state = state["rms"]
+      rnn_state = state["rnn"]
+      learning_rate_state = state["learning_rate"]
+    return (grad_values, decay_state, rms_state, rnn_state, learning_rate_state,
+            grad_indices)
+
+  def _unpack_rnn_state_into_tuples(self, rnn_state):
+    """Creates state tuples from the rnn state vector."""
+    rnn_state_tuples = []
+    cur_state_pos = 0
+    for cell in self.component_cells:
+      total_state_size = sum(cell.state_size)
+      cur_state = tf.slice(rnn_state, [0, cur_state_pos],
+                           [-1, total_state_size])
+      cur_state_tuple = tf.split(value=cur_state, num_or_size_splits=2,
+                                 axis=1)
+      rnn_state_tuples.append(cur_state_tuple)
+      cur_state_pos += total_state_size
+    return rnn_state_tuples
+
+  def _pack_tuples_into_rnn_state(self, rnn_state_tuples):
+    """Creates a single state vector concatenated along column axis."""
+    rnn_state = None
+    for new_state_tuple in rnn_state_tuples:
+      new_c, new_h = new_state_tuple
+      if rnn_state is None:
+        rnn_state = tf.concat([new_c, new_h], axis=1)
+      else:
+        rnn_state = tf.concat([rnn_state, tf.concat([new_c, new_h], 1)], axis=1)
+    return rnn_state
+
--- a/learned_optimizer/optimizer/global_learning_rate.py
+++ b/learned_optimizer/optimizer/global_learning_rate.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A trainable optimizer that learns a single global learning rate."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from learned_optimizer.optimizer import trainable_optimizer
+
+
+class GlobalLearningRate(trainable_optimizer.TrainableOptimizer):
+  """Optimizes for a single global learning rate."""
+
+  def __init__(self, initial_rate=1e-3, **kwargs):
+    """Initializes the global learning rate."""
+    with tf.variable_scope(trainable_optimizer.OPTIMIZER_SCOPE):
+      initializer = tf.constant_initializer(initial_rate)
+      self.learning_rate = tf.get_variable("global_learning_rate", shape=(),
+                                           initializer=initializer)
+    super(GlobalLearningRate, self).__init__("GLR", [], **kwargs)
+
+  def _compute_update(self, param, grad, state):
+    return param - tf.scalar_mul(self.learning_rate, grad), state
+
--- a/learned_optimizer/optimizer/hierarchical_rnn.py
+++ b/learned_optimizer/optimizer/hierarchical_rnn.py
--- a/learned_optimizer/optimizer/learning_rate_schedule.py
+++ b/learned_optimizer/optimizer/learning_rate_schedule.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A trainable optimizer that learns a learning rate schedule."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from learned_optimizer.optimizer import trainable_optimizer
+
+
+class LearningRateSchedule(trainable_optimizer.TrainableOptimizer):
+  """Learns a learning rate schedule over a fixed number of iterations."""
+
+  def __init__(self, initial_rate=0.0, n_steps=1000, **kwargs):
+    """Initializes the learning rates."""
+    self.max_index = tf.constant(n_steps-1, dtype=tf.int32)
+
+    with tf.variable_scope(trainable_optimizer.OPTIMIZER_SCOPE):
+      initializer = tf.constant_initializer(initial_rate)
+      self.learning_rates = tf.get_variable("learning_rates",
+                                            shape=([n_steps,]),
+                                            initializer=initializer)
+
+    super(LearningRateSchedule, self).__init__("LRS", ["itr"], **kwargs)
+
+  def _initialize_state(self, var):
+    """Return a dictionary mapping names of state variables to their values."""
+    return {
+        "itr": tf.constant(0, dtype=tf.int32),
+    }
+
+  def _compute_update(self, param, grad, state):
+    """Compute updates of parameters."""
+
+    # get the learning rate at the current index, if the index
+    # is greater than the number of available learning rates,
+    # use the last one
+    index = tf.minimum(state["itr"], self.max_index)
+    learning_rate = tf.gather(self.learning_rates, index)
+
+    # update the parameters: parameter - learning_rate * gradient
+    updated_param = param - tf.scalar_mul(learning_rate, grad)
+
+    return updated_param, {"itr": state["itr"] + 1}
--- a/learned_optimizer/optimizer/rnn_cells.py
+++ b/learned_optimizer/optimizer/rnn_cells.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Custom RNN cells for hierarchical RNNs."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from learned_optimizer.optimizer import utils
+
+
+class BiasGRUCell(tf.contrib.rnn.RNNCell):
+  """GRU cell (cf. http://arxiv.org/abs/1406.1078) with an additional bias."""
+
+  def __init__(self, num_units, activation=tf.tanh, scale=0.1,
+               gate_bias_init=0., random_seed=None):
+    self._num_units = num_units
+    self._activation = activation
+    self._scale = scale
+    self._gate_bias_init = gate_bias_init
+    self._random_seed = random_seed
+
+  @property
+  def state_size(self):
+    return self._num_units
+
+  @property
+  def output_size(self):
+    return self._num_units
+
+  def __call__(self, inputs, state, bias=None):
+    # Split the injected bias vector into a bias for the r, u, and c updates.
+    if bias is None:
+      bias = tf.zeros((1, 3))
+
+    r_bias, u_bias, c_bias = tf.split(bias, 3, 1)
+
+    with tf.variable_scope(type(self).__name__):  # "BiasGRUCell"
+      with tf.variable_scope("gates"):  # Reset gate and update gate.
+        proj = utils.affine([inputs, state], 2 * self._num_units,
+                            scale=self._scale, bias_init=self._gate_bias_init,
+                            random_seed=self._random_seed)
+        r_lin, u_lin = tf.split(proj, 2, 1)
+        r, u = tf.nn.sigmoid(r_lin + r_bias), tf.nn.sigmoid(u_lin + u_bias)
+
+      with tf.variable_scope("candidate"):
+        proj = utils.affine([inputs, r * state], self._num_units,
+                            scale=self._scale, random_seed=self._random_seed)
+        c = self._activation(proj + c_bias)
+
+      new_h = u * state + (1 - u) * c
+
+    return new_h, new_h
--- a/learned_optimizer/optimizer/trainable_adam.py
+++ b/learned_optimizer/optimizer/trainable_adam.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""A trainable ADAM optimizer that learns its internal variables."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from learned_optimizer.optimizer import trainable_optimizer as opt
+from learned_optimizer.optimizer import utils
+
+
+class TrainableAdam(opt.TrainableOptimizer):
+  """Adam optimizer with learnable scalar parameters.
+
+  See Kingma et. al., 2014 for algorithm (http://arxiv.org/abs/1412.6980).
+  """
+
+  def __init__(self,
+               learning_rate=1e-3,
+               beta1=0.9,
+               beta2=0.999,
+               epsilon=1e-8,
+               **kwargs):
+    """Initializes the TrainableAdam optimizer with the given initial values.
+
+    Args:
+      learning_rate: The learning rate (default: 1e-3).
+      beta1: The exponential decay rate for the 1st moment estimates.
+      beta2: The exponential decay rate for the 2nd moment estimates.
+      epsilon: A small constant for numerical stability.
+      **kwargs: Any additional keyword arguments for TrainableOptimizer.
+
+    Raises:
+      ValueError: if the learning rate or epsilon is not positive
+      ValueError: if beta1 or beta2 is not in (0, 1).
+    """
+    if learning_rate <= 0:
+      raise ValueError("Learning rate must be positive.")
+    if epsilon <= 0:
+      raise ValueError("Epsilon must be positive.")
+    if not 0 < beta1 < 1 or not 0 < beta2 < 1:
+      raise ValueError("Beta values must be between 0 and 1, exclusive.")
+
+    self._reuse_vars = False
+
+    with tf.variable_scope(opt.OPTIMIZER_SCOPE):
+      def inv_sigmoid(x):
+        return np.log(x / (1.0 - x))
+
+      self.log_learning_rate = tf.get_variable(
+          "log_learning_rate",
+          shape=[],
+          initializer=tf.constant_initializer(np.log(learning_rate)))
+      self.beta1_logit = tf.get_variable(
+          "beta1_logit",
+          shape=[],
+          initializer=tf.constant_initializer(inv_sigmoid(beta1)))
+      self.beta2_logit = tf.get_variable(
+          "beta2_logit",
+          shape=[],
+          initializer=tf.constant_initializer(inv_sigmoid(beta2)))
+      self.log_epsilon = tf.get_variable(
+          "log_epsilon",
+          shape=[],
+          initializer=tf.constant_initializer(np.log(epsilon)))
+
+    # Key names are derived from Algorithm 1 described in
+    # https://arxiv.org/pdf/1412.6980.pdf
+    state_keys = ["m", "v", "t"]
+    super(TrainableAdam, self).__init__("Adam", state_keys, **kwargs)
+
+  def _initialize_state(self, var):
+    """Returns a dictionary mapping names of state variables to their values."""
+    vectorized_shape = var.get_shape().num_elements(), 1
+
+    return {key: tf.zeros(vectorized_shape) for key in self.state_keys}
+
+  def _compute_update(self, param, grad, state):
+    """Calculates the new internal state and parameters.
+
+    If the gradient is sparse, updates the appropriate slices in the internal
+    state and stacks the update tensor.
+
+    Args:
+      param: A tensor of parameters.
+      grad: A tensor of gradients with the same shape as param.
+      state: A dictionary containing any state for the optimizer.
+
+    Returns:
+      updated_param: The updated parameters.
+      updated_state: The updated state variables in a dictionary.
+    """
+
+    with tf.variable_scope(opt.OPTIMIZER_SCOPE) as scope:
+
+      if self._reuse_vars:
+        scope.reuse_variables()
+      else:
+        self._reuse_vars = True
+
+      (grad_values, first_moment, second_moment, timestep, grad_indices
+      ) = self._extract_gradients_and_internal_state(
+          grad, state, tf.shape(param))
+
+      beta1 = tf.nn.sigmoid(self.beta1_logit)
+      beta2 = tf.nn.sigmoid(self.beta2_logit)
+      epsilon = tf.exp(self.log_epsilon) + 1e-10
+      learning_rate = tf.exp(self.log_learning_rate)
+
+      old_grad_shape = tf.shape(grad_values)
+      grad_values = tf.reshape(grad_values, [-1, 1])
+
+      new_timestep = timestep + 1
+      new_first_moment = self._update_adam_estimate(
+          first_moment, grad_values, beta1)
+      new_second_moment = self._debias_adam_estimate(
+          second_moment, tf.square(grad_values), beta2)
+
+      debiased_first_moment = self._debias_adam_estimate(
+          new_first_moment, beta1, new_timestep)
+      debiased_second_moment = self._debias_adam_estimate(
+          new_second_moment, beta2, new_timestep)
+
+      # Propagating through the square root of 0 is very bad for stability.
+      update = (learning_rate * debiased_first_moment /
+                (tf.sqrt(debiased_second_moment + 1e-10) + epsilon))
+
+      update = tf.reshape(update, old_grad_shape)
+
+      if grad_indices is not None:
+        param_shape = tf.shape(param)
+        update = utils.stack_tensor(
+            update, grad_indices, param, param_shape[:1])
+        new_first_moment = utils.update_slices(
+            new_first_moment, grad_indices, state["m"], param_shape)
+        new_second_moment = utils.update_slices(
+            new_second_moment, grad_indices, state["v"], param_shape)
+        new_timestep = utils.update_slices(
+            new_timestep, grad_indices, state["t"], param_shape)
+
+      new_param = param - update
+
+      # collect the update and new state
+      new_state = {
+          "m": new_first_moment,
+          "v": new_second_moment,
+          "t": new_timestep
+      }
+
+    return new_param, new_state
+
+  def _update_adam_estimate(self, estimate, value, beta):
+    """Returns a beta-weighted average of estimate and value."""
+    return (beta * estimate) + ((1 - beta) * value)
+
+  def _debias_adam_estimate(self, estimate, beta, t_step):
+    """Returns a debiased estimate based on beta and the timestep."""
+    return estimate / (1 - tf.pow(beta, t_step))
+
+  def _extract_gradients_and_internal_state(self, grad, state, param_shape):
+    """Extracts the gradients and relevant internal state.
+
+    If the gradient is sparse, extracts the appropriate slices from the state.
+
+    Args:
+      grad: The current gradient.
+      state: The current state.
+      param_shape: The shape of the parameter (used if gradient is sparse).
+
+    Returns:
+      grad_values: The gradient value tensor.
+      first_moment: The first moment tensor (internal state).
+      second_moment: The second moment tensor (internal state).
+      timestep: The current timestep (internal state).
+      grad_indices: The indices for the gradient tensor, if sparse.
+          None otherwise.
+    """
+    grad_values = grad
+    grad_indices = None
+    first_moment = state["m"]
+    second_moment = state["v"]
+    timestep = state["t"]
+
+    if isinstance(grad, tf.IndexedSlices):
+      grad_indices, grad_values = utils.accumulate_sparse_gradients(grad)
+      first_moment = utils.slice_tensor(
+          first_moment, grad_indices, param_shape)
+      second_moment = utils.slice_tensor(
+          second_moment, grad_indices, param_shape)
+      timestep = utils.slice_tensor(timestep, grad_indices, param_shape)
+
+    return grad_values, first_moment, second_moment, timestep, grad_indices
+
--- a/learned_optimizer/optimizer/trainable_optimizer.py
+++ b/learned_optimizer/optimizer/trainable_optimizer.py
--- a/learned_optimizer/optimizer/utils.py
+++ b/learned_optimizer/optimizer/utils.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Utilities and helper functions."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+
+def make_finite(t, replacement):
+  """Replaces non-finite tensor values with the replacement value."""
+  return tf.where(tf.is_finite(t), t, replacement)
+
+
+def asinh(x):
+  """Computes the inverse hyperbolic sine function (in tensorflow)."""
+  return tf.log(x + tf.sqrt(1. + x ** 2))
+
+
+def affine(inputs, output_size, scope="Affine", scale=0.1, vec_mean=0.,
+           include_bias=True, bias_init=0., random_seed=None):
+  """Computes an affine function of the inputs.
+
+  Creates or recalls tensorflow variables "Matrix" and "Bias"
+  to generate an affine operation on the input.
+
+  If the inputs are a list of tensors, they are concatenated together.
+
+  Initial weights for the matrix are drawn from a Gaussian with zero
+  mean and standard deviation that is the given scale divided by the
+  square root of the input dimension. Initial weights for the bias are
+  set to zero.
+
+  Args:
+    inputs: List of tensors with shape (batch_size, input_size)
+    output_size: Size (dimension) of the output
+    scope: Variable scope for these parameters (default: "Affine")
+    scale: Initial weight scale for the matrix parameters (default: 0.1),
+      this constant is divided by the sqrt of the input size to get the
+      std. deviation of the initial weights
+    vec_mean: The mean for the random initializer
+    include_bias: Whether to include the bias term
+    bias_init: The initializer bias (default 0.)
+    random_seed: Random seed for random initializers. (Default: None)
+
+  Returns:
+    output: Tensor with shape (batch_size, output_size)
+  """
+
+  # Concatenate the input arguments.
+  x = tf.concat(inputs, 1)
+
+  with tf.variable_scope(scope):
+    input_size = x.get_shape().as_list()[1]
+
+    sigma = scale / np.sqrt(input_size)
+    rand_init = tf.random_normal_initializer(mean=vec_mean, stddev=sigma,
+                                             seed=random_seed)
+
+    matrix = tf.get_variable("Matrix", [input_size, output_size],
+                             dtype=tf.float32, initializer=rand_init)
+
+    if include_bias:
+      bias = tf.get_variable("Bias", [output_size], dtype=tf.float32,
+                             initializer=tf.constant_initializer(bias_init,
+                                                                 tf.float32))
+    else:
+      bias = 0.
+    output = tf.matmul(x, matrix) + bias
+
+  return output
+
+
+def project(inputs, weights, bias=0., activation=tf.identity):
+  """Computes an affine or linear projection of the inputs.
+
+  Projects the inputs onto the given weight vector and (optionally)
+  adds a bias and passes the result through an activation function.
+
+  Args:
+    inputs: matrix of inputs with shape [batch_size, dim]
+    weights: weight matrix with shape [dim, output_dim]
+    bias: bias vector with shape [output_dim] (default: 0)
+    activation: nonlinear activation function (default: tf.identity)
+
+  Returns:
+    outputs: an op which computes activation(inputs @ weights + bias)
+  """
+  return activation(tf.matmul(inputs, weights) + bias)
+
+
+def new_mean_squared(grad_vec, decay, ms):
+  """Calculates the new accumulated mean squared of the gradient.
+
+  Args:
+    grad_vec: the vector for the current gradient
+    decay: the decay term
+    ms: the previous mean_squared value
+
+  Returns:
+    the new mean_squared value
+  """
+  decay_size = decay.get_shape().num_elements()
+  decay_check_ops = [
+      tf.assert_less_equal(decay, 1., summarize=decay_size),
+      tf.assert_greater_equal(decay, 0., summarize=decay_size)]
+
+  with tf.control_dependencies(decay_check_ops):
+    grad_squared = tf.square(grad_vec)
+
+  # If the previous mean_squared is the 0 vector, don't use the decay and just
+  # return the full grad_squared. This should only happen on the first timestep.
+  decay = tf.cond(tf.reduce_all(tf.equal(ms, 0.)),
+                  lambda: tf.zeros_like(decay, dtype=tf.float32), lambda: decay)
+
+  # Update the running average of squared gradients.
+  epsilon = 1e-12
+  return (1. - decay) * (grad_squared + epsilon) + decay * ms
+
+
+def rms_scaling(gradient, decay, ms, update_ms=True):
+  """Vectorizes and scales a tensor of gradients.
+
+  Args:
+    gradient: the current gradient
+    decay: the current decay value.
+    ms: the previous mean squared value
+    update_ms: Whether to update the mean squared value (default: True)
+
+  Returns:
+    The scaled gradient and the new ms value if update_ms is True,
+    the old ms value otherwise.
+  """
+
+  # Vectorize the gradients and compute the squared gradients.
+  grad_vec = tf.reshape(gradient, [-1, 1])
+
+  if update_ms:
+    ms = new_mean_squared(grad_vec, decay, ms)
+
+  # Scale the current gradients by the RMS, squashed by the asinh function.
+  scaled_gradient = asinh(grad_vec / tf.sqrt(ms + 1e-16))
+
+  return scaled_gradient, ms
+
+
+def accumulate_sparse_gradients(grad):
+  """Accumulates repeated indices of a sparse gradient update.
+
+  Args:
+    grad: a tf.IndexedSlices gradient
+
+  Returns:
+    grad_indices: unique indices
+    grad_values: gradient values corresponding to the indices
+  """
+
+  grad_indices, grad_segments = tf.unique(grad.indices)
+  grad_values = tf.unsorted_segment_sum(grad.values, grad_segments,
+                                        tf.shape(grad_indices)[0])
+  return grad_indices, grad_values
+
+
+def slice_tensor(dense_tensor, indices, head_dims):
+  """Extracts slices from a partially flattened dense tensor.
+
+  indices is assumed to index into the first dimension of head_dims.
+  dense_tensor is assumed to have a shape [D_0, D_1, ...] such that
+  prod(head_dims) == D_0. This function will extract slices along the
+  first_dimension of head_dims.
+
+  Example:
+
+  Consider a tensor with shape head_dims = [100, 2] and a dense_tensor with
+  shape [200, 3]. Note that the first dimension of dense_tensor equals the
+  product of head_dims. This function will reshape dense_tensor such that
+  its shape is now [100, 2, 3] (i.e. the first dimension became head-dims)
+  and then slice it along the first dimension. After slicing, the slices will
+  have their initial dimensions flattened just as they were in dense_tensor
+  (e.g. if there are 4 indices, the return value will have a shape of [4, 3]).
+
+  Args:
+    dense_tensor: a N-D dense tensor. Shape: [D_0, D_1, ...]
+    indices: a 1-D integer tensor. Shape: [K]
+    head_dims: True dimensions of the dense_tensor's first dimension.
+
+  Returns:
+    Extracted slices. Shape [K, D_1, ...]
+  """
+
+  tail_dims = tf.shape(dense_tensor)[1:]
+  dense_tensor = tf.reshape(dense_tensor,
+                            tf.concat([head_dims, tail_dims], 0))
+
+  slices = tf.gather(dense_tensor, indices)
+  # NOTE(siege): This kills the shape annotation.
+  return tf.reshape(slices, tf.concat([[-1], tail_dims], 0))
+
+
+def stack_tensor(slices, indices, dense_tensor, head_dims):
+  """Reconsititutes a tensor from slices and corresponding indices.
+
+  This is an inverse operation to slice_tensor. Missing slices are set to 0.
+
+  Args:
+    slices: a tensor. Shape [K, D_1, ...]
+    indices: a 1-D integer tensor. Shape: [K]
+    dense_tensor: the original tensor the slices were taken
+      from. Shape: [D_0, D_1, ...]
+    head_dims: True dimensions of the dense_tensor's first dimension.
+
+  Returns:
+    Reconsituted tensor. Shape: [D_0, D_1, ...]
+  """
+  # NOTE(siege): This cast shouldn't be necessary.
+  indices = tf.cast(indices, tf.int32)
+
+  tail_dims = tf.shape(dense_tensor)[1:]
+  dense_shape = tf.concat([head_dims, tail_dims], 0)
+
+  slices = tf.reshape(slices, tf.concat([[-1], dense_shape[1:]], 0))
+  indices = tf.expand_dims(indices, -1)
+
+  return tf.reshape(tf.scatter_nd(indices, slices, dense_shape),
+                    tf.shape(dense_tensor))
+
+
+def update_slices(slices, indices, dense_tensor, head_dims):
+  """Reconstitutes a tensor from slices and corresponding indices.
+
+  Like _stack_tensor, but instead of setting missing slices to 0, sets them to
+  what they were in the original tensor. The return value is reshaped to be
+  the same as dense_tensor.
+
+  Args:
+    slices: a tensor. Shape [K, D_1, ...]
+    indices: a 1-D integer tensor. Shape: [K]
+    dense_tensor: the original tensor the slices were taken
+      from. Shape: [D_0, D_1, ...]
+    head_dims: True dimensions of the dense_tensor's first dimension.
+
+  Returns:
+    Reconsituted tensor. Shape: [D_0, D_1, ...]
+  """
+  # NOTE(siege): This cast shouldn't be necessary.
+  indices = tf.cast(indices, tf.int32)
+
+  tail_dims = tf.shape(dense_tensor)[1:]
+  dense_shape = tf.concat([head_dims, tail_dims], 0)
+
+  update_mask_vals = tf.fill(tf.shape(indices), 1)
+  reshaped_indices = tf.expand_dims(indices, -1)
+  update_mask = tf.equal(
+      tf.scatter_nd(reshaped_indices, update_mask_vals, head_dims[:1]), 1)
+
+  reshaped_dense_slices = tf.reshape(
+      stack_tensor(slices, indices, dense_tensor, head_dims), dense_shape)
+  reshaped_dense_tensor = tf.reshape(dense_tensor, dense_shape)
+
+  return tf.reshape(
+      tf.where(update_mask, reshaped_dense_slices, reshaped_dense_tensor),
+      tf.shape(dense_tensor))
--- a/learned_optimizer/problems/BUILD
+++ b/learned_optimizer/problems/BUILD
+package(default_visibility = ["//visibility:public"])
+
+# Libraries
+# =====
+
+py_library(
+    name = "datasets",
+    srcs = ["datasets.py"],
+    deps = [
+    ],
+)
+
+py_library(
+    name = "model_adapter",
+    srcs = ["model_adapter.py"],
+    deps = [
+        ":problem_generator",
+    ],
+)
+
+py_library(
+    name = "problem_generator",
+    srcs = ["problem_generator.py"],
+    deps = [
+        ":problem_spec",
+    ],
+)
+
+py_library(
+    name = "problem_sets",
+    srcs = ["problem_sets.py"],
+    deps = [
+        ":datasets",
+        ":model_adapter",
+        ":problem_generator",
+    ],
+)
+
+py_library(
+    name = "problem_spec",
+    srcs = ["problem_spec.py"],
+    deps = [],
+)
--- a/learned_optimizer/problems/datasets.py
+++ b/learned_optimizer/problems/datasets.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Functions to generate or load datasets for supervised learning."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import namedtuple
+
+import numpy as np
+from sklearn.datasets import make_classification
+
+MAX_SEED = 4294967295
+
+
+class Dataset(namedtuple("Dataset", "data labels")):
+  """Helper class for managing a supervised learning dataset.
+
+  Args:
+    data: an array of type float32 with N samples, each of which is the set
+      of features for that sample. (Shape (N, D_i), where N is the number of
+      samples and D_i is the number of features for that sample.)
+    labels: an array of type int32 or int64 with N elements, indicating the
+      class label for the corresponding set of features in data.
+  """
+  # Since this is an immutable object, we don't need to reserve slots.
+  __slots__ = ()
+
+  @property
+  def size(self):
+    """Dataset size (number of samples)."""
+    return len(self.data)
+
+  def batch_indices(self, num_batches, batch_size):
+    """Creates indices of shuffled minibatches.
+
+    Args:
+      num_batches: the number of batches to generate
+      batch_size: the size of each batch
+
+    Returns:
+      batch_indices: a list of minibatch indices, arranged so that the dataset
+          is randomly shuffled.
+
+    Raises:
+      ValueError: if the data and labels have different lengths
+    """
+    if len(self.data) != len(self.labels):
+      raise ValueError("Labels and data must have the same number of samples.")
+
+    batch_indices = []
+
+    # Follows logic in mnist.py to ensure we cover the entire dataset.
+    index_in_epoch = 0
+    dataset_size = len(self.data)
+    dataset_indices = np.arange(dataset_size)
+    np.random.shuffle(dataset_indices)
+
+    for _ in range(num_batches):
+      start = index_in_epoch
+      index_in_epoch += batch_size
+      if index_in_epoch > dataset_size:
+
+        # Finished epoch, reshuffle.
+        np.random.shuffle(dataset_indices)
+
+        # Start next epoch.
+        start = 0
+        index_in_epoch = batch_size
+
+      end = index_in_epoch
+      batch_indices.append(dataset_indices[start:end].tolist())
+
+    return batch_indices
+
+
+def noisy_parity_class(n_samples,
+                       n_classes=2,
+                       n_context_ids=5,
+                       noise_prob=0.25,
+                       random_seed=None):
+  """Returns a randomly generated sparse-to-sparse dataset.
+
+  The label is a parity class of a set of context classes.
+
+  Args:
+    n_samples: number of samples (data points)
+    n_classes: number of class labels (default: 2)
+    n_context_ids: how many classes to take the parity of (default: 5).
+    noise_prob: how often to corrupt the label (default: 0.25)
+    random_seed: seed used for drawing the random data (default: None)
+  Returns:
+    dataset: A Dataset namedtuple containing the generated data and labels
+  """
+  np.random.seed(random_seed)
+  x = np.random.randint(0, n_classes, [n_samples, n_context_ids])
+  noise = np.random.binomial(1, noise_prob, [n_samples])
+  y = (np.sum(x, 1) + noise) % n_classes
+  return Dataset(x.astype("float32"), y.astype("int32"))
+
+
+def random(n_features, n_samples, n_classes=2, sep=1.0, random_seed=None):
+  """Returns a randomly generated classification dataset.
+
+  Args:
+    n_features: number of features (dependent variables)
+    n_samples: number of samples (data points)
+    n_classes: number of class labels (default: 2)
+    sep: separation of the two classes, a higher value corresponds to
+      an easier classification problem (default: 1.0)
+    random_seed: seed used for drawing the random data (default: None)
+
+  Returns:
+    dataset: A Dataset namedtuple containing the generated data and labels
+  """
+  # Generate the problem data.
+  x, y = make_classification(n_samples=n_samples,
+                             n_features=n_features,
+                             n_informative=n_features,
+                             n_redundant=0,
+                             n_classes=n_classes,
+                             class_sep=sep,
+                             random_state=random_seed)
+
+  return Dataset(x.astype("float32"), y.astype("int32"))
+
+
+def random_binary(n_features, n_samples, random_seed=None):
+  """Returns a randomly generated dataset of binary values.
+
+  Args:
+    n_features: number of features (dependent variables)
+    n_samples: number of samples (data points)
+    random_seed: seed used for drawing the random data (default: None)
+
+  Returns:
+    dataset: A Dataset namedtuple containing the generated data and labels
+  """
+  random_seed = (np.random.randint(MAX_SEED) if random_seed is None
+                 else random_seed)
+  np.random.seed(random_seed)
+
+  x = np.random.randint(2, size=(n_samples, n_features))
+  y = np.zeros((n_samples, 1))
+
+  return Dataset(x.astype("float32"), y.astype("int32"))
+
+
+def random_symmetric(n_features, n_samples, random_seed=None):
+  """Returns a randomly generated dataset of values and their negatives.
+
+  Args:
+    n_features: number of features (dependent variables)
+    n_samples: number of samples (data points)
+    random_seed: seed used for drawing the random data (default: None)
+
+  Returns:
+    dataset: A Dataset namedtuple containing the generated data and labels
+  """
+  random_seed = (np.random.randint(MAX_SEED) if random_seed is None
+                 else random_seed)
+  np.random.seed(random_seed)
+
+  x1 = np.random.normal(size=(int(n_samples/2), n_features))
+  x = np.concatenate((x1, -x1), axis=0)
+  y = np.zeros((n_samples, 1))
+
+  return Dataset(x.astype("float32"), y.astype("int32"))
+
+
+def random_mlp(n_features, n_samples, random_seed=None, n_layers=6, width=20):
+  """Returns a generated output of an MLP with random weights.
+
+  Args:
+    n_features: number of features (dependent variables)
+    n_samples: number of samples (data points)
+    random_seed: seed used for drawing the random data (default: None)
+    n_layers: number of layers in random MLP
+    width: width of the layers in random MLP
+
+  Returns:
+    dataset: A Dataset namedtuple containing the generated data and labels
+  """
+  random_seed = (np.random.randint(MAX_SEED) if random_seed is None
+                 else random_seed)
+  np.random.seed(random_seed)
+
+  x = np.random.normal(size=(n_samples, n_features))
+  y = x
+  n_in = n_features
+  scale_factor = np.sqrt(2.) / np.sqrt(n_features)
+  for _ in range(n_layers):
+    weights = np.random.normal(size=(n_in, width)) * scale_factor
+    y = np.dot(y, weights).clip(min=0)
+    n_in = width
+
+  y = y[:, 0]
+  y[y > 0] = 1
+
+  return Dataset(x.astype("float32"), y.astype("int32"))
+
+
+EMPTY_DATASET = Dataset(np.array([], dtype="float32"),
+                        np.array([], dtype="int32"))
--- a/learned_optimizer/problems/model_adapter.py
+++ b/learned_optimizer/problems/model_adapter.py
+# Copyright 2017 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Implementation of the ModelAdapter class."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import mock
+import tensorflow as tf
+
+from learned_optimizer.problems import problem_generator as pg
+
+
+class ModelAdapter(pg.Problem):
+  """Adapts Tensorflow models/graphs into a form suitable for meta-training.
+
+  This class adapts an existing TensorFlow graph into a form suitable for
+  meta-training a learned optimizer.
+  """
+
+  def __init__(self, make_loss_and_init_fn):
+    """Wraps a model in the Problem interface.
+
+    make_loss_and_init argument is a callable that returns a tuple of
+    two other callables as follows.
+
+    The first will construct most of the graph and return the problem loss. It
+    is essential that this graph contains the totality of the model's variables,
+    but none of its queues.
+
+    The second will return construct the model initialization graph given a list
+    of parameters and return a callable that is passed an instance of
+    tf.Session, and should initialize the models' parameters.
+
+    An argument value function would look like this:
+
+    ```python
+    def make_loss_and_init_fn():
+      inputs = queued_reader()
+
+      def make_loss():
+        return create_model_with_variables(inputs)
+
+      def make_init_fn(parameters):
+        saver = tf.Saver(parameters)
+        def init_fn(sess):
+          sess.restore(sess, ...)
+        return init_fn
+
+      return make_loss, make_init_fn
+    ```
+
+    Args:
+      make_loss_and_init_fn: a callable, as described aboce
+    """
+    make_loss_fn, make_init_fn = make_loss_and_init_fn()
+
+    self.make_loss_fn = make_loss_fn
+    self.parameters, self.constants = _get_variables(make_loss_fn)
+
+    if make_init_fn is not None:
+      init_fn = make_init_fn(self.parameters + self.constants)
+    else:
+      init_op = tf.initialize_variables(self.parameters + self.constants)
+      init_fn = lambda sess: sess.run(init_op)
+
+    tf.logging.info("ModelAdapter parameters: %s",
+                    [op.name for op in self.parameters])
+    tf.logging.info("ModelAdapter constants: %s",
+                    [op.name for op in self.constants])
+
+    super(ModelAdapter, self).__init__(
+        [], random_seed=None, noise_stdev=0.0, init_fn=init_fn)
+
+  def init_tensors(self, seed=None):
+    """Returns a list of tensors with the given shape."""
+    return self.parameters
+
+  def init_variables(self, seed=None):
+    """Returns a list of variables with the given shape."""
+    # NOTE(siege): This is awkward, as these are not set as trainable.
+    return self.parameters
+
+  def objective(self, parameters, data=None, labels=None):
+    """Computes the objective given a list of parameters.
+
+    Args:
+      parameters: The parameters to optimize (as a list of tensors)
+      data: An optional batch of data for calculating objectives
+      labels: An optional batch of corresponding labels
+
+    Returns:
+      A scalar tensor representing the objective value
+    """
+    # We need to set up a mapping based on the original parameter names, because
+    # the parameters passed can be arbitrary tensors.
+    parameter_mapping = {
+        old_p.name: p
+        for old_p, p in zip(self.parameters, parameters)
+    }
+
+    with tf.variable_scope(tf.get_variable_scope(), reuse=True):
+      return _make_with_custom_variables(self.make_loss_fn, parameter_mapping)
+
+
+def _get_variables(func):
+  """Calls func, returning any variables created.
+
+  The created variables are modified to not be trainable, and are placed into
+  the LOCAL_VARIABLES collection.
+
+  Args:
+    func: Function to be called.
+
+  Returns:
+    A tuple (variables, constants) where the first element is a list of
+    trainable variables and the second is the non-trainable variables.
+  """
+  variables = []
+  constants = []
+
+  # We need to create these variables like normal, so grab the original
+  # constructor before we mock it.
+  original_init = tf.Variable.__init__
+
+  def custom_init(self, *args, **kwargs):
+    trainable = kwargs["trainable"]
+    kwargs["trainable"] = False
+    # Making these variables local keeps them out of the optimizer's checkpoints
+    # somehow.
+    kwargs["collections"] = [tf.GraphKeys.LOCAL_VARIABLES]
+    original_init(self, *args, **kwargs)
+    if trainable:
+      variables.append(self)
+    else:
+      constants.append(self)
+
+  # This name-scope is just a nicety for TensorBoard.
+  with tf.name_scope("unused_graph"):
+    with mock.patch.object(tf.Variable, "__init__", custom_init):
+      func()
+
+  return variables, constants
+
+
+def _make_with_custom_variables(func, variable_mapping):
+  """Calls func and replaces the value of some variables created in it.
+
+  Args:
+    func: Function to be called.
+    variable_mapping: A mapping of variable name to the replacement tensor or
+      tf.Variable.
+
+  Returns:
+    The return value of func is returned.
+  """
+  original_value = tf.Variable.value
+
+  def custom_value(self):
+    if self.name in variable_mapping:
+      replacement = variable_mapping[self.name]
+      tf.logging.info("Replaced %s with %s" % (self.name, replacement))
+
+      # value() method needs to return a tensor, we need to call value on it.
+      # This has to be done manually like this otherwise we'll get an infinite
+      # loop.
+      if isinstance(replacement, tf.Variable):
+        replacement = original_value(replacement)
+
+      return replacement
+    else:
+      return original_value(self)
+
+  with mock.patch.object(tf.Variable, "value", custom_value):
+    with mock.patch.object(tf.Variable, "_AsTensor", custom_value):
+      return func()
--- a/learned_optimizer/problems/problem_generator.py
+++ b/learned_optimizer/problems/problem_generator.py
--- a/learned_optimizer/problems/problem_sets.py
+++ b/learned_optimizer/problems/problem_sets.py