Add run individual step only option (#4049)

* Add run individual step only option * Fix comments and update readme * Add valiation argument * Address comments * Make code shorter * Fix more lints

Add run individual step only option (#4049)
* Add run individual step only option * Fix comments and update readme * Add valiation argument * Address comments * Make code shorter * Fix more lints
088e24fb · Yanhui Liang · GitHub · 5be3c064 · 088e24fb · 088e24fb
Unverified Commit 088e24fb authored May 04, 2018 by Yanhui Liang Committed by GitHub May 04, 2018
6 changed files
--- a/research/minigo/README.md
+++ b/research/minigo/README.md
 # MiniGo
 This is a simplified implementation of MiniGo based on the code provided by the authors: [MiniGo](https://github.com/tensorflow/minigo).
-MiniGo is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo. The current implementation consists of three main modules: the DualNet model, the Monte Carlo Tree Search (MCTS), and Go domain knowledge. Currently the **model** part is our focus.
+MiniGo is a minimalist Go engine modeled after AlphaGo Zero, ["Mastering the Game of Go without Human
+Knowledge"](https://www.nature.com/articles/nature24270). An useful one-diagram overview of Alphago Zero can be found in the [cheat sheet](https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0).
-This implementation maintains the features of model training and validation, and also provides evaluation of two Go models.
+The implementation of MiniGo consists of three main components: the DualNet model, the Monte Carlo Tree Search (MCTS), and Go domain knowledge. Currently, the **DualNet model** is our focus.
-## DualNet Model
+## DualNet Architecture
+DualNet is the neural network used in MiniGo. It's based on residual blocks with two heads output. Following is a brief overview of the DualNet architecture.
+### Input Features
 The input to the neural network is a [board_size * board_size * 17] image stack
 comprising 17 binary feature planes. 8 feature planes consist of binary values
 indicating the presence of the current player's stones; A further 8 feature
 planes represent the corresponding features for the opponent's stones; The final
 feature plane represents the color to play, and has a constant value of either 1
-if black is to play or 0 if white to play. Check `features.py` for more details.
+if black is to play or 0 if white to play. Check [features.py](features.py) for more details.
+### Neural Network Structure
 In MiniGo implementation, the input features are processed by a residual tower
 that consists of a single convolutional block followed by either 9 or 19
 residual blocks.
@@ -31,8 +36,9 @@ Each residual block applies the following modules sequentially to its input:
  6. A skip connection that adds the input to the block
  7. A rectifier non-linearity
-Note: num_filter is 128 for 19 x 19 board size, and 32 for 9 x 9 board size.
+Note: num_filter is 128 for 19 x 19 board size, and 32 for 9 x 9 board size in MiniGo implementation.
+### Dual Heads Output
 The output of the residual tower is passed into two separate "heads" for
 computing the policy and value respectively. The policy head applies the
 following modules:
@@ -51,7 +57,7 @@ The value head applies the following modules:
  6. A fully connected linear layer to a scalar
  7. A tanh non-linearity outputting a scalar in the range [-1, 1]
-The overall network depth, in the 10 or 20 block network, is 19 or 39
+In MiniGo, the overall network depth, in the 10 or 20 block network, is 19 or 39
 parameterized layers respectively for the residual tower, plus an additional 2
 layers for the policy head and 3 layers for the value head.
@@ -59,56 +65,74 @@ layers for the policy head and 3 layers for the value head.
 This project assumes you have virtualenv, TensorFlow (>= 1.5) and two other Go-related
 packages pygtp(>=0.4) and sgf (==0.5).
 ## Training Model
-One iteration of reinforcement learning consists of the following steps:
+One iteration of reinforcement learning (RL) consists of the following steps:
- - Bootstrap: initializes a random model
+ - Bootstrap: initializes a random DualNet model. If the estimator directory has exist, the model is initialized with the last checkpoint.
- - Selfplay: plays games with the latest model, producing data used for training
+ - Selfplay: plays games with the latest model or the best model so far identified by evaluation, producing data used for training
 - Gather: groups games played with the same model into larger files of tfexamples.
- - Train: trains a new model with the selfplay results from the most recent N
+ - Train: trains a new model with the selfplay results from the most recent N generations.
-   generations.
- Run `minigo.py`.
+To run the RL pipeline, issue the following command:
 ```
- python minigo.py
+ python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256
 ```
+ Arguments:
+   * `--base_dir`: Base directory for MiniGo data and models. If not specified, it's set as /tmp/minigo/ by default.
+   * `--board_size`: Go board size. It can be either 9 or 19. By default, it's 9.
+   * `--batch_size`: Batch size for model training. If not specified, it's calculated based on go board size.
+ Use the `--help` or `-h` flag to get a full list of possible arguments. Besides all these arguments, other parameters about RL pipeline and DualNet model can be found and configured in [model_params.py](model_params.py).
+Suppose the base directory argument `base_dir` is `$HOME/minigo/` and we use 9 as the `board_size`. After model training, the following directories are created to store models and game data:
+    $HOME/minigo                  # base directory
+    │
+    ├── 9_size                    # directory for 9x9 board size
+    │   │
+    │   ├── data
+    │   │   ├── holdout           # holdout data for model validation
+    │   │   ├── selfplay          # data generated by selfplay of each model
+    │   │   └── training_chunks   # gatherd tf_examples for model training
+    │   │
+    │   ├── estimator_model_dir   # estimator working directory
+    │   │
+    │   ├── trained_models        # all the trained models
+    │   │
+    │   └── sgf                   # sgf (smart go files) folder
+    │       ├── 000000-bootstrap  # model name
+    │       │      ├── clean      # clean sgf files of model selfplay
+    │       │      └── full       # full sgf files of model selfplay
+    │       ├── ...
+    │       └── evaluate          # clean sgf files of model evaluation
+    │
+    └── ...
 ## Validating Model
- Run `minigo.py` with `--validation` argument
+To validate the trained model, issue the following command with `--validation` argument:
 ```
- python minigo.py --validation
+ python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --validation
- ```
- The `--validation` argument is to generate holdout dataset for model validation
-## Evaluating MiniGo Models
- Run `minigo.py` with `--evaluation` argument
 ```
- python minigo.py --evaluation
- ```
- The `--evaluation` argument is to invoke the evaluation between the latest model and the current best model.
-## Testing Pipeline
-As the whole RL pipeline may takes hours to train even for a 9x9 board size, we provide a dummy model with a `--debug` mode for testing purpose.
- Run `minigo.py` with `--debug` argument
+## Evaluating Models
- ```
+The performance of two models are compared with evaluation step. Given two models, one plays black and the other plays white. They play several games (# of games can be configured by parameter `eval_games` in [model_params.py](model_params.py)), and the one wins by a margin of 55% will be the winner.
- python minigo.py --debug
- ```
- The `--debug` argument is for testing purpose with a dummy model.
-Validation and evaluation can also be tested with the dummy model by combing their corresponding arguments with `--debug`.
+To include the evaluation step in the RL pipeline, `--evaluation` argument can be specified to compare the performance of the `current_trained_model` and the `best_model_so_far`. The winner is used to update `best_model_so_far`. Run the following command to include evaluation step in the pipeline:
-To test validation, run the following commands:
- ```
- python minigo.py --debug --validation
- ```
-To test evaluation, run the following commands:
 ```
- python minigo.py --debug --evaluation
+ python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --evaluation
- ```
-To test both validation and evaluation, run the following commands:
- ```
- python minigo.py --debug --validation --evaluation
 ```
-## MCTS and Go features (TODO)
+## Testing Pipeline
-Code clean up on MCTS and Go features.
+As the whole RL pipeline may take hours to train even for a 9x9 board size, a `--test` argument is provided to test the pipeline quickly with a dummy neural network model.
+To test the RL pipeline with a dummy model, issue the following command:
+```
+ python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --test
+```
+## Running Self-play Only
+Self-play only option is provided to run selfplay step individually to generate training data in parallel. Issue the following command to run selfplay only with the latest trained model:
+```
+ python minigo.py --selfplay
+```
+Other optional arguments:
+   * `--selfplay_model_name`: The name of the model used for selfplay only. If not specified, the latest trained model will be used for selfplay.
+   * `--selfplay_max_games`: The maximum number of games selfplay is required to generate. If not specified, the default parameter `max_games_per_generation` is used.
--- a/research/minigo/dualnet.py
+++ b/research/minigo/dualnet.py
@@ -191,24 +191,24 @@ def export_model(working_dir, model_path):
    tf.gfile.Copy(filename, destination_path)
-def train(working_dir, tf_records, generation_num, params):
+def train(working_dir, tf_records, generation, params):
  """Train the model for a specific generation.
  Args:
    working_dir: The model working directory to save model parameters,
      drop logs, checkpoints, and so on.
    tf_records: A list of tf_record filenames for training input.
-    generation_num: The generation to be trained.
+    generation: The generation to be trained.
    params: hyperparams of the model.
  Raises:
-    ValueError: if generation_num is not greater than 0.
+    ValueError: if generation is not greater than 0.
  """
-  if generation_num <= 0:
+  if generation <= 0:
    raise ValueError('Model 0 is random weights')
  estimator = tf.estimator.Estimator(
      dualnet_model.model_fn, model_dir=working_dir, params=params)
-  max_steps = (generation_num * params.examples_per_generation
+  max_steps = (generation * params.examples_per_generation
               // params.batch_size)
  profiler_hook = tf.train.ProfilerHook(output_dir=working_dir, save_secs=600)

--- a/research/minigo/gtp_wrapper.py
+++ b/research/minigo/gtp_wrapper.py
@@ -49,8 +49,7 @@ class GtpInterface(object):
  def set_size(self, n):
    if n != self.board_size:
      raise ValueError((
-          '''Can't handle boardsize {n}!Restart with env var BOARD_SIZE={n}'''
+          "Can't handle boardsize {}! Please check the board size.").format(n))
-          ).format(n=n))
  def set_komi(self, komi):
    self.komi = komi
@@ -75,7 +74,7 @@ class GtpInterface(object):
      self.position.flip_playerturn(mutate=True)
  def make_move(self, color, vertex):
-    c = coords.from_pygtp(vertex)
+    c = coords.from_pygtp(self.board_size, vertex)
    # let's assume this never happens for now.
    # self.accomodate_out_of_turn(color)
    return self.play_move(c)
@@ -85,7 +84,7 @@ class GtpInterface(object):
    move = self.suggest_move(self.position)
    if self.should_resign():
      return gtp.RESIGN
-    return coords.to_pygtp(move)
+    return coords.to_pygtp(self.board_size, move)
  def final_score(self):
    return self.position.result_string()

--- a/research/minigo/minigo.py
+++ b/research/minigo/minigo.py
@@ -66,7 +66,7 @@ def bootstrap(estimator_model_dir, trained_models_dir, params):
    estimator_model_dir: tf.estimator model directory.
    trained_models_dir: Dir to save the trained models. Here to export the first
      bootstrapped generation.
-    params: An object of hyperparameters for the model.
+    params: A MiniGoParams instance of hyperparameters for the model.
  """
  bootstrap_name = utils.generate_model_name(0)
  _ensure_dir_exists(trained_models_dir)
@@ -79,41 +79,23 @@ def bootstrap(estimator_model_dir, trained_models_dir, params):
  dualnet.export_model(estimator_model_dir, bootstrap_model_path)
-def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir,
+def selfplay(selfplay_dirs, selfplay_model, params):
-             params):
  """Perform selfplay with a specific model.
  Args:
-    model_name: The name of the model used for selfplay.
+    selfplay_dirs: A dict to specify the directories used in selfplay.
-    trained_models_dir: The path to the model files.
+      selfplay_dirs = {
-    selfplay_dir: Where to write the games. Set as 'base_dir/data/selfplay/'.
+          'output_dir': output_dir,
-    holdout_dir: Where to write the holdout data. Set as
+          'holdout_dir': holdout_dir,
-      'base_dir/data/holdout/'.
+          'clean_sgf': clean_sgf,
-    sgf_dir: Where to write the sgf (Smart Game Format) files. Set as
+          'full_sgf': full_sgf
-      'base_dir/sgf/'.
+      }
-    params: An object of hyperparameters for the model.
+    selfplay_model: The actual Dualnet runner for selfplay.
+    params: A MiniGoParams instance of hyperparameters for the model.
  """
-  print('Playing a game with model {}'.format(model_name))
-  # Set paths for the model with 'model_name'
-  model_path = os.path.join(trained_models_dir, model_name)
-  output_dir = os.path.join(selfplay_dir, model_name)
-  holdout_dir = os.path.join(holdout_dir, model_name)
-  # clean_sgf is to write sgf file without comments.
-  # full_sgf is to write sgf file with comments.
-  clean_sgf = os.path.join(sgf_dir, model_name, 'clean')
-  full_sgf = os.path.join(sgf_dir, model_name, 'full')
-  _ensure_dir_exists(output_dir)
-  _ensure_dir_exists(holdout_dir)
-  _ensure_dir_exists(clean_sgf)
-  _ensure_dir_exists(full_sgf)
-  with utils.logged_timer('Loading weights from {} ... '.format(model_path)):
-    network = dualnet.DualNetRunner(model_path, params)
  with utils.logged_timer('Playing game'):
    player = selfplay_mcts.play(
-        params.board_size, network, params.selfplay_readouts,
+        params.board_size, selfplay_model, params.selfplay_readouts,
        params.selfplay_resign_threshold, params.simultaneous_leaves,
        params.selfplay_verbose)
@@ -124,8 +106,8 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir,
        os.path.join(dir_sgf, '{}.sgf'.format(output_name)), 'w') as f:
      f.write(player.to_sgf(use_comments=use_comments))
-  _write_sgf_data(clean_sgf, use_comments=False)
+  _write_sgf_data(selfplay_dirs['clean_sgf'], use_comments=False)
-  _write_sgf_data(full_sgf, use_comments=True)
+  _write_sgf_data(selfplay_dirs['full_sgf'], use_comments=True)
  game_data = player.extract_data()
  tf_examples = preprocessing.make_dataset_from_selfplay(game_data, params)
@@ -133,10 +115,10 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir,
  # Hold out 5% of games for evaluation.
  if random.random() < params.holdout_pct:
    fname = os.path.join(
-        holdout_dir, ('{}'+_TF_RECORD_SUFFIX).format(output_name))
+        selfplay_dirs['holdout_dir'], output_name + _TF_RECORD_SUFFIX)
  else:
    fname = os.path.join(
-        output_dir, ('{}'+_TF_RECORD_SUFFIX).format(output_name))
+        selfplay_dirs['output_dir'], output_name + _TF_RECORD_SUFFIX)
  preprocessing.write_tf_examples(fname, tf_examples)
@@ -148,7 +130,7 @@ def gather(selfplay_dir, training_chunk_dir, params):
    selfplay_dir: Where to look for games. Set as 'base_dir/data/selfplay/'.
    training_chunk_dir: where to put collected games. Set as
      'base_dir/data/training_chunks/'.
-    params: An object of hyperparameters for the model.
+    params: A MiniGoParams instance of hyperparameters for the model.
  """
  # Check the selfplay data from the most recent 50 models.
  _ensure_dir_exists(training_chunk_dir)
@@ -196,22 +178,22 @@ def gather(selfplay_dir, training_chunk_dir, params):
    f.write('\n'.join(sorted(already_processed)))
-def train(trained_models_dir, estimator_model_dir, training_chunk_dir, params):
+def train(trained_models_dir, estimator_model_dir, training_chunk_dir,
+          generation, params):
  """Train the latest model from gathered data.
  Args:
    trained_models_dir: Where to export the completed generation.
    estimator_model_dir: tf.estimator model directory.
    training_chunk_dir: Directory where gathered training chunks are.
-    params: An object of hyperparameters for the model.
+    generation: Which generation you are training.
+    params: A MiniGoParams instance of hyperparameters for the model.
  """
-  model_num, model_name = utils.get_latest_model(trained_models_dir)
+  new_model_name = utils.generate_model_name(generation)
-  print('Initializing from model {}'.format(model_name))
-  new_model_name = utils.generate_model_name(model_num + 1)
  print('New model will be {}'.format(new_model_name))
-  save_file = os.path.join(trained_models_dir, new_model_name)
+  new_model = os.path.join(trained_models_dir, new_model_name)
+  print('Training on gathered game data...')
  tf_records = sorted(
      tf.gfile.Glob(os.path.join(training_chunk_dir, '*'+_TF_RECORD_SUFFIX)))
  tf_records = tf_records[
@@ -219,8 +201,8 @@ def train(trained_models_dir, estimator_model_dir, training_chunk_dir, params):
  print('Training from: {} to {}'.format(tf_records[0], tf_records[-1]))
  with utils.logged_timer('Training'):
-    dualnet.train(estimator_model_dir, tf_records, model_num + 1, params)
+    dualnet.train(estimator_model_dir, tf_records, generation, params)
-    dualnet.export_model(estimator_model_dir, save_file)
+    dualnet.export_model(estimator_model_dir, new_model)
 def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
@@ -230,7 +212,7 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
    trained_models_dir: Directories where the completed generations/models are.
    holdout_dir: Directories where holdout data are.
    estimator_model_dir: tf.estimator model directory.
-    params: An object of hyperparameters for the model.
+    params: A MiniGoParams instance of hyperparameters for the model.
  """
  model_num, _ = utils.get_latest_model(trained_models_dir)
@@ -251,6 +233,11 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
        tf_records.extend(
            tf.gfile.Glob(os.path.join(record_dir, '*'+_TF_RECORD_SUFFIX)))
+  if not tf_records:
+    print('No holdout dataset for validation! '
+          'Please check your holdout directory: {}'.format(holdout_dir))
+    return
  print('The length of tf_records is {}.'.format(len(tf_records)))
  first_tf_record = os.path.basename(tf_records[0])
  last_tf_record = os.path.basename(tf_records[-1])
@@ -259,21 +246,22 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
    dualnet.validate(estimator_model_dir, tf_records, params)
-def evaluate(trained_models_dir, black_model_name, white_model_name,
+def evaluate(black_model_name, black_net, white_model_name, white_net,
             evaluate_dir, params):
  """Evaluate with two models.
-  With the model name, construct two DualNetRunners to play as black and white
+  With two DualNetRunners to play as black and white in a Go match. Two models
-  in a Go match. Two models play several names, and the model that wins by a
+  play several games, and the model that wins by a margin of 55% will be the
-  margin of 55% will be the winner.
+  winner.
  Args:
-    trained_models_dir: Directories where the completed generations/models are.
    black_model_name: The name of the model playing black.
+    black_net: The DualNetRunner model for black
    white_model_name: The name of the model playing white.
+    white_net: The DualNetRunner model for white.
    evaluate_dir: Where to write the evaluation results. Set as
-      'base_dir/sgf/evaluate/''
+      'base_dir/sgf/evaluate/'.
-    params: An object of hyperparameters for the model.
+    params: A MiniGoParams instance of hyperparameters for the model.
  Returns:
    The model name of the winner.
@@ -281,19 +269,6 @@ def evaluate(trained_models_dir, black_model_name, white_model_name,
  Raises:
      ValueError: if neither `WHITE` or `BLACK` is returned.
  """
-  black_model = os.path.join(trained_models_dir, black_model_name)
-  white_model = os.path.join(trained_models_dir, white_model_name)
-  print('Evaluate models between {} and {}'.format(
-      black_model_name, white_model_name))
-  _ensure_dir_exists(evaluate_dir)
-  with utils.logged_timer('Loading weights'):
-    black_net = dualnet.DualNetRunner(black_model, params)
-    white_net = dualnet.DualNetRunner(white_model, params)
  with utils.logged_timer('{} games'.format(params.eval_games)):
    winner = evaluation.play_match(
        params, black_net, white_net, params.eval_games,
@@ -305,38 +280,122 @@ def evaluate(trained_models_dir, black_model_name, white_model_name,
  return black_model_name if winner == go.BLACK_NAME else white_model_name
-def _set_params_from_board_size(board_size):
+def _set_params(flags):
-  """Set hyperparameters from board size."""
+  """Set hyperparameters from board size.
+  Args:
+    flags: Flags from Argparser.
+  Returns:
+  An MiniGoParams instance of hyperparameters.
+  """
  params = model_params.MiniGoParams()
-  k = utils.round_power_of_two(board_size ** 2 / 3)
+  k = utils.round_power_of_two(flags.board_size ** 2 / 3)
  params.num_filters = k  # Number of filters in the convolution layer
  params.fc_width = 2 * k  # Width of each fully connected layer
-  params.num_shared_layers = board_size  # Number of shared trunk layers
+  params.num_shared_layers = flags.board_size  # Number of shared trunk layers
-  params.board_size = board_size  # Board size
+  params.board_size = flags.board_size  # Board size
  # How many positions can fit on a graphics card. 256 for 9s, 16 or 32 for 19s.
-  if FLAGS.board_size == 9:
+  if flags.batch_size is None:
-    params.batch_size = 256
+    if flags.board_size == 9:
+      params.batch_size = 256
+    else:
+      params.batch_size = 32
  else:
-    params.batch_size = 32
+    params.batch_size = flags.batch_size
  return params
+def _prepare_selfplay(
+    model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir, params):
+  """Set directories and load the network for selfplay.
+  Args:
+    model_name: The name of the model for self-play
+    trained_models_dir: Directories where the completed generations/models are.
+    selfplay_dir: Where to write the games. Set as 'base_dir/data/selfplay/'.
+    holdout_dir: Where to write the holdout data. Set as
+      'base_dir/data/holdout/'.
+    sgf_dir: Where to write the sgf (Smart Game Format) files. Set as
+      'base_dir/sgf/'.
+    params: A MiniGoParams instance of hyperparameters for the model.
+  Returns:
+    The directories and network model for selfplay.
+  """
+  # Set paths for the model with 'model_name'
+  model_path = os.path.join(trained_models_dir, model_name)
+  output_dir = os.path.join(selfplay_dir, model_name)
+  holdout_dir = os.path.join(holdout_dir, model_name)
+  # clean_sgf is to write sgf file without comments.
+  # full_sgf is to write sgf file with comments.
+  clean_sgf = os.path.join(sgf_dir, model_name, 'clean')
+  full_sgf = os.path.join(sgf_dir, model_name, 'full')
+  _ensure_dir_exists(output_dir)
+  _ensure_dir_exists(holdout_dir)
+  _ensure_dir_exists(clean_sgf)
+  _ensure_dir_exists(full_sgf)
+  selfplay_dirs = {
+      'output_dir': output_dir,
+      'holdout_dir': holdout_dir,
+      'clean_sgf': clean_sgf,
+      'full_sgf': full_sgf
+  }
+  # cache the network model for self-play
+  with utils.logged_timer('Loading weights from {} ... '.format(model_path)):
+    network = dualnet.DualNetRunner(model_path, params)
+  return selfplay_dirs, network
+def run_selfplay(selfplay_model, selfplay_games, dirs, params):
+  """Run selfplay to generate training data.
+  Args:
+    selfplay_model: The model name for selfplay.
+    selfplay_games: The number of selfplay games.
+    dirs: A MiniGoDirectory instance of directories used in each step.
+    params: A MiniGoParams instance of hyperparameters for the model.
+  """
+  selfplay_dirs, network = _prepare_selfplay(
+      selfplay_model, dirs.trained_models_dir, dirs.selfplay_dir,
+      dirs.holdout_dir, dirs.sgf_dir, params)
+  print('Self-play with model: {}'.format(selfplay_model))
+  for _ in range(selfplay_games):
+    selfplay(selfplay_dirs, network, params)
 def main(_):
  """Run the reinforcement learning loop."""
  tf.logging.set_verbosity(tf.logging.INFO)
-  params = _set_params_from_board_size(FLAGS.board_size)
+  params = _set_params(FLAGS)
  # A dummy model for debug/testing purpose with fewer games and iterations
-  if FLAGS.debug:
+  if FLAGS.test:
    params = model_params.DummyMiniGoParams()
+    base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_size_dummy/'
+  else:
+    # Set directories for models and datasets
+    base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_size/'
-  # Set directories for models and datasets
-  base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_board_size/'
  dirs = utils.MiniGoDirectory(base_dir)
+  # Run selfplay only if user specifies the argument.
+  if FLAGS.selfplay:
+    selfplay_model_name = FLAGS.selfplay_model_name or utils.get_latest_model(
+        dirs.trained_models_dir)[1]
+    max_games = FLAGS.selfplay_max_games or params.max_games_per_generation
+    run_selfplay(selfplay_model_name, max_games, dirs, params)
+    return
+  # Run the RL pipeline
  # if no models have been trained, start from bootstrap model
-  if os.path.isdir(base_dir) is False:
+  if not os.path.isdir(dirs.trained_models_dir):
    print('No trained model exists! Starting from Bootstrap...')
    print('Creating random initial weights...')
    bootstrap(dirs.estimator_model_dir, dirs.trained_models_dir, params)
@@ -345,50 +404,51 @@ def main(_):
    print('Start from the last checkpoint...')
  _, best_model_so_far = utils.get_latest_model(dirs.trained_models_dir)
  for rl_iter in range(params.max_iters_per_pipeline):
    print('RL_iteration: {}'.format(rl_iter))
+    # Self-play with the best model to generate training data
+    run_selfplay(
+        best_model_so_far, params.max_games_per_generation, dirs, params)
-    # Self-play to generate at least params.max_games_per_generation games
+    # gather selfplay data for training
-    selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
-             dirs.holdout_dir, dirs.sgf_dir, params)
-    games = tf.gfile.Glob(
-        os.path.join(dirs.selfplay_dir, best_model_so_far, '*.zz'))
-    while len(games) < params.max_games_per_generation:
-      selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
-               dirs.holdout_dir, dirs.sgf_dir, params)
-      if FLAGS.validation:
-        params = model_params.DummyValidationParams()
-        selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
-                 dirs.holdout_dir, dirs.sgf_dir, params)
-      games = tf.gfile.Glob(
-          os.path.join(dirs.selfplay_dir, best_model_so_far, '*.zz'))
    print('Gathering game output...')
    gather(dirs.selfplay_dir, dirs.training_chunk_dir, params)
+    # train the next generation model
+    model_num, _ = utils.get_latest_model(dirs.trained_models_dir)
    print('Training on gathered game data...')
    train(dirs.trained_models_dir, dirs.estimator_model_dir,
-          dirs.training_chunk_dir, params)
+          dirs.training_chunk_dir, model_num + 1, params)
+    # validate the latest model if needed
    if FLAGS.validation:
      print('Validating on the holdout game data...')
      validate(dirs.trained_models_dir, dirs.holdout_dir,
               dirs.estimator_model_dir, params)
    _, current_model = utils.get_latest_model(dirs.trained_models_dir)
    if FLAGS.evaluation:  # Perform evaluation if needed
-      print('Evaluating the latest model...')
+      print('Evaluate models between {} and {}'.format(
+          best_model_so_far, current_model))
+      black_model = os.path.join(dirs.trained_models_dir, best_model_so_far)
+      white_model = os.path.join(dirs.trained_models_dir, current_model)
+      _ensure_dir_exists(dirs.evaluate_dir)
+      with utils.logged_timer('Loading weights'):
+        black_net = dualnet.DualNetRunner(black_model, params)
+        white_net = dualnet.DualNetRunner(white_model, params)
      best_model_so_far = evaluate(
-          dirs.trained_models_dir, best_model_so_far, current_model,
+          best_model_so_far, black_net, current_model, white_net,
          dirs.evaluate_dir, params)
-      print('Winner: {}!'.format(best_model_so_far))
+      print('Winner of evaluation: {}!'.format(best_model_so_far))
    else:
      best_model_so_far = current_model
 if __name__ == '__main__':
  parser = argparse.ArgumentParser()
+  # flags to run the RL pipeline
  parser.add_argument(
      '--base_dir',
      type=str,
@@ -402,18 +462,45 @@ if __name__ == '__main__':
      metavar='N',
      choices=[9, 19],
      help='Go board size. The default size is 9.')
+  parser.add_argument(
+      '--batch_size',
+      type=int,
+      default=None,
+      metavar='BS',
+      help='Batch size for training. The default size is None')
+  # Test the pipeline with a dummy model
+  parser.add_argument(
+      '--test',
+      action='store_true',
+      help='A boolean to test RL pipeline with a dummy model.')
+  # Run RL pipeline with the validation step
+  parser.add_argument(
+      '--validation',
+      action='store_true',
+      help='A boolean to specify validation in the RL pipeline.')
+  # Run RL pipeline with the evaluation step
  parser.add_argument(
      '--evaluation',
      action='store_true',
      help='A boolean to specify evaluation in the RL pipeline.')
+  # self-play only
  parser.add_argument(
-      '--debug',
+      '--selfplay',
      action='store_true',
-      help='A boolean to indicate debug mode for testing purpose.')
+      help='A boolean to run self-play only.')
  parser.add_argument(
-      '--validation',
+      '--selfplay_model_name',
-      action='store_true',
+      type=str,
-      help='A boolean to explicitly generate holdout data for validation.')
+      default=None,
+      metavar='SM',
+      help='The model used for self-play only.')
+  parser.add_argument(
+      '--selfplay_max_games',
+      type=int,
+      default=None,
+      metavar='SMG',
+      help='The number of game data self-play only needs to generate')
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
--- a/research/minigo/model_params.py
+++ b/research/minigo/model_params.py
@@ -18,7 +18,7 @@
 class MiniGoParams(object):
  """Parameters for MiniGo."""
-  # Go params
+  # Go board size
  board_size = 9
  # RL pipeline
@@ -51,6 +51,7 @@ class MiniGoParams(object):
  # the number of simultaneous leaves in MCTS
  simultaneous_leaves = 8
+  # holdout data for validation
  holdout_pct = 0.05  # How many games to hold out for validation
  holdout_generation = 50  # How many recent generations/models for holdout data
@@ -63,7 +64,7 @@ class MiniGoParams(object):
  # AGZ used the most recent 500k games, which, assuming 250 moves/game = 125M
  train_window_size = 125000000
-  # evaluation
+  # evaluation with two models
  eval_games = 50  # The number of games to play in evaluation
  eval_readouts = 100  # How many readouts to make per move in evaluation
  eval_verbose = 1  # How verbose the players should be in evaluation

--- a/research/minigo/utils.py
+++ b/research/minigo/utils.py
@@ -205,7 +205,7 @@ class MiniGoDirectory(object):
  """The class to set up directories of MiniGo."""
  def __init__(self, base_dir):
-    self.trained_models_dir = os.path.join(base_dir, 'models')
+    self.trained_models_dir = os.path.join(base_dir, 'trained_models')
    self.estimator_model_dir = os.path.join(base_dir, 'estimator_model_dir/')
    self.selfplay_dir = os.path.join(base_dir, 'data/selfplay/')
    self.holdout_dir = os.path.join(base_dir, 'data/holdout/')