Unverified Commit 088e24fb authored by Yanhui Liang's avatar Yanhui Liang Committed by GitHub
Browse files

Add run individual step only option (#4049)

* Add run individual step only option

* Fix comments and update readme

* Add valiation argument

* Address comments

* Make code shorter

* Fix more lints
parent 5be3c064
# MiniGo # MiniGo
This is a simplified implementation of MiniGo based on the code provided by the authors: [MiniGo](https://github.com/tensorflow/minigo). This is a simplified implementation of MiniGo based on the code provided by the authors: [MiniGo](https://github.com/tensorflow/minigo).
MiniGo is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo. The current implementation consists of three main modules: the DualNet model, the Monte Carlo Tree Search (MCTS), and Go domain knowledge. Currently the **model** part is our focus. MiniGo is a minimalist Go engine modeled after AlphaGo Zero, ["Mastering the Game of Go without Human
Knowledge"](https://www.nature.com/articles/nature24270). An useful one-diagram overview of Alphago Zero can be found in the [cheat sheet](https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0).
This implementation maintains the features of model training and validation, and also provides evaluation of two Go models. The implementation of MiniGo consists of three main components: the DualNet model, the Monte Carlo Tree Search (MCTS), and Go domain knowledge. Currently, the **DualNet model** is our focus.
## DualNet Model ## DualNet Architecture
DualNet is the neural network used in MiniGo. It's based on residual blocks with two heads output. Following is a brief overview of the DualNet architecture.
### Input Features
The input to the neural network is a [board_size * board_size * 17] image stack The input to the neural network is a [board_size * board_size * 17] image stack
comprising 17 binary feature planes. 8 feature planes consist of binary values comprising 17 binary feature planes. 8 feature planes consist of binary values
indicating the presence of the current player's stones; A further 8 feature indicating the presence of the current player's stones; A further 8 feature
planes represent the corresponding features for the opponent's stones; The final planes represent the corresponding features for the opponent's stones; The final
feature plane represents the color to play, and has a constant value of either 1 feature plane represents the color to play, and has a constant value of either 1
if black is to play or 0 if white to play. Check `features.py` for more details. if black is to play or 0 if white to play. Check [features.py](features.py) for more details.
### Neural Network Structure
In MiniGo implementation, the input features are processed by a residual tower In MiniGo implementation, the input features are processed by a residual tower
that consists of a single convolutional block followed by either 9 or 19 that consists of a single convolutional block followed by either 9 or 19
residual blocks. residual blocks.
...@@ -31,8 +36,9 @@ Each residual block applies the following modules sequentially to its input: ...@@ -31,8 +36,9 @@ Each residual block applies the following modules sequentially to its input:
6. A skip connection that adds the input to the block 6. A skip connection that adds the input to the block
7. A rectifier non-linearity 7. A rectifier non-linearity
Note: num_filter is 128 for 19 x 19 board size, and 32 for 9 x 9 board size. Note: num_filter is 128 for 19 x 19 board size, and 32 for 9 x 9 board size in MiniGo implementation.
### Dual Heads Output
The output of the residual tower is passed into two separate "heads" for The output of the residual tower is passed into two separate "heads" for
computing the policy and value respectively. The policy head applies the computing the policy and value respectively. The policy head applies the
following modules: following modules:
...@@ -51,7 +57,7 @@ The value head applies the following modules: ...@@ -51,7 +57,7 @@ The value head applies the following modules:
6. A fully connected linear layer to a scalar 6. A fully connected linear layer to a scalar
7. A tanh non-linearity outputting a scalar in the range [-1, 1] 7. A tanh non-linearity outputting a scalar in the range [-1, 1]
The overall network depth, in the 10 or 20 block network, is 19 or 39 In MiniGo, the overall network depth, in the 10 or 20 block network, is 19 or 39
parameterized layers respectively for the residual tower, plus an additional 2 parameterized layers respectively for the residual tower, plus an additional 2
layers for the policy head and 3 layers for the value head. layers for the policy head and 3 layers for the value head.
...@@ -59,56 +65,74 @@ layers for the policy head and 3 layers for the value head. ...@@ -59,56 +65,74 @@ layers for the policy head and 3 layers for the value head.
This project assumes you have virtualenv, TensorFlow (>= 1.5) and two other Go-related This project assumes you have virtualenv, TensorFlow (>= 1.5) and two other Go-related
packages pygtp(>=0.4) and sgf (==0.5). packages pygtp(>=0.4) and sgf (==0.5).
## Training Model ## Training Model
One iteration of reinforcement learning consists of the following steps: One iteration of reinforcement learning (RL) consists of the following steps:
- Bootstrap: initializes a random model - Bootstrap: initializes a random DualNet model. If the estimator directory has exist, the model is initialized with the last checkpoint.
- Selfplay: plays games with the latest model, producing data used for training - Selfplay: plays games with the latest model or the best model so far identified by evaluation, producing data used for training
- Gather: groups games played with the same model into larger files of tfexamples. - Gather: groups games played with the same model into larger files of tfexamples.
- Train: trains a new model with the selfplay results from the most recent N - Train: trains a new model with the selfplay results from the most recent N generations.
generations.
Run `minigo.py`. To run the RL pipeline, issue the following command:
``` ```
python minigo.py python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256
``` ```
Arguments:
* `--base_dir`: Base directory for MiniGo data and models. If not specified, it's set as /tmp/minigo/ by default.
* `--board_size`: Go board size. It can be either 9 or 19. By default, it's 9.
* `--batch_size`: Batch size for model training. If not specified, it's calculated based on go board size.
Use the `--help` or `-h` flag to get a full list of possible arguments. Besides all these arguments, other parameters about RL pipeline and DualNet model can be found and configured in [model_params.py](model_params.py).
Suppose the base directory argument `base_dir` is `$HOME/minigo/` and we use 9 as the `board_size`. After model training, the following directories are created to store models and game data:
$HOME/minigo # base directory
├── 9_size # directory for 9x9 board size
│ │
│ ├── data
│ │ ├── holdout # holdout data for model validation
│ │ ├── selfplay # data generated by selfplay of each model
│ │ └── training_chunks # gatherd tf_examples for model training
│ │
│ ├── estimator_model_dir # estimator working directory
│ │
│ ├── trained_models # all the trained models
│ │
│ └── sgf # sgf (smart go files) folder
│ ├── 000000-bootstrap # model name
│ │ ├── clean # clean sgf files of model selfplay
│ │ └── full # full sgf files of model selfplay
│ ├── ...
│ └── evaluate # clean sgf files of model evaluation
└── ...
## Validating Model ## Validating Model
Run `minigo.py` with `--validation` argument To validate the trained model, issue the following command with `--validation` argument:
``` ```
python minigo.py --validation python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --validation
```
The `--validation` argument is to generate holdout dataset for model validation
## Evaluating MiniGo Models
Run `minigo.py` with `--evaluation` argument
``` ```
python minigo.py --evaluation
```
The `--evaluation` argument is to invoke the evaluation between the latest model and the current best model.
## Testing Pipeline
As the whole RL pipeline may takes hours to train even for a 9x9 board size, we provide a dummy model with a `--debug` mode for testing purpose.
Run `minigo.py` with `--debug` argument ## Evaluating Models
``` The performance of two models are compared with evaluation step. Given two models, one plays black and the other plays white. They play several games (# of games can be configured by parameter `eval_games` in [model_params.py](model_params.py)), and the one wins by a margin of 55% will be the winner.
python minigo.py --debug
```
The `--debug` argument is for testing purpose with a dummy model.
Validation and evaluation can also be tested with the dummy model by combing their corresponding arguments with `--debug`. To include the evaluation step in the RL pipeline, `--evaluation` argument can be specified to compare the performance of the `current_trained_model` and the `best_model_so_far`. The winner is used to update `best_model_so_far`. Run the following command to include evaluation step in the pipeline:
To test validation, run the following commands:
```
python minigo.py --debug --validation
```
To test evaluation, run the following commands:
``` ```
python minigo.py --debug --evaluation python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --evaluation
```
To test both validation and evaluation, run the following commands:
```
python minigo.py --debug --validation --evaluation
``` ```
## MCTS and Go features (TODO) ## Testing Pipeline
Code clean up on MCTS and Go features. As the whole RL pipeline may take hours to train even for a 9x9 board size, a `--test` argument is provided to test the pipeline quickly with a dummy neural network model.
To test the RL pipeline with a dummy model, issue the following command:
```
python minigo.py --base_dir=$HOME/minigo/ --board_size=9 --batch_size=256 --test
```
## Running Self-play Only
Self-play only option is provided to run selfplay step individually to generate training data in parallel. Issue the following command to run selfplay only with the latest trained model:
```
python minigo.py --selfplay
```
Other optional arguments:
* `--selfplay_model_name`: The name of the model used for selfplay only. If not specified, the latest trained model will be used for selfplay.
* `--selfplay_max_games`: The maximum number of games selfplay is required to generate. If not specified, the default parameter `max_games_per_generation` is used.
...@@ -191,24 +191,24 @@ def export_model(working_dir, model_path): ...@@ -191,24 +191,24 @@ def export_model(working_dir, model_path):
tf.gfile.Copy(filename, destination_path) tf.gfile.Copy(filename, destination_path)
def train(working_dir, tf_records, generation_num, params): def train(working_dir, tf_records, generation, params):
"""Train the model for a specific generation. """Train the model for a specific generation.
Args: Args:
working_dir: The model working directory to save model parameters, working_dir: The model working directory to save model parameters,
drop logs, checkpoints, and so on. drop logs, checkpoints, and so on.
tf_records: A list of tf_record filenames for training input. tf_records: A list of tf_record filenames for training input.
generation_num: The generation to be trained. generation: The generation to be trained.
params: hyperparams of the model. params: hyperparams of the model.
Raises: Raises:
ValueError: if generation_num is not greater than 0. ValueError: if generation is not greater than 0.
""" """
if generation_num <= 0: if generation <= 0:
raise ValueError('Model 0 is random weights') raise ValueError('Model 0 is random weights')
estimator = tf.estimator.Estimator( estimator = tf.estimator.Estimator(
dualnet_model.model_fn, model_dir=working_dir, params=params) dualnet_model.model_fn, model_dir=working_dir, params=params)
max_steps = (generation_num * params.examples_per_generation max_steps = (generation * params.examples_per_generation
// params.batch_size) // params.batch_size)
profiler_hook = tf.train.ProfilerHook(output_dir=working_dir, save_secs=600) profiler_hook = tf.train.ProfilerHook(output_dir=working_dir, save_secs=600)
......
...@@ -49,8 +49,7 @@ class GtpInterface(object): ...@@ -49,8 +49,7 @@ class GtpInterface(object):
def set_size(self, n): def set_size(self, n):
if n != self.board_size: if n != self.board_size:
raise ValueError(( raise ValueError((
'''Can't handle boardsize {n}!Restart with env var BOARD_SIZE={n}''' "Can't handle boardsize {}! Please check the board size.").format(n))
).format(n=n))
def set_komi(self, komi): def set_komi(self, komi):
self.komi = komi self.komi = komi
...@@ -75,7 +74,7 @@ class GtpInterface(object): ...@@ -75,7 +74,7 @@ class GtpInterface(object):
self.position.flip_playerturn(mutate=True) self.position.flip_playerturn(mutate=True)
def make_move(self, color, vertex): def make_move(self, color, vertex):
c = coords.from_pygtp(vertex) c = coords.from_pygtp(self.board_size, vertex)
# let's assume this never happens for now. # let's assume this never happens for now.
# self.accomodate_out_of_turn(color) # self.accomodate_out_of_turn(color)
return self.play_move(c) return self.play_move(c)
...@@ -85,7 +84,7 @@ class GtpInterface(object): ...@@ -85,7 +84,7 @@ class GtpInterface(object):
move = self.suggest_move(self.position) move = self.suggest_move(self.position)
if self.should_resign(): if self.should_resign():
return gtp.RESIGN return gtp.RESIGN
return coords.to_pygtp(move) return coords.to_pygtp(self.board_size, move)
def final_score(self): def final_score(self):
return self.position.result_string() return self.position.result_string()
......
...@@ -66,7 +66,7 @@ def bootstrap(estimator_model_dir, trained_models_dir, params): ...@@ -66,7 +66,7 @@ def bootstrap(estimator_model_dir, trained_models_dir, params):
estimator_model_dir: tf.estimator model directory. estimator_model_dir: tf.estimator model directory.
trained_models_dir: Dir to save the trained models. Here to export the first trained_models_dir: Dir to save the trained models. Here to export the first
bootstrapped generation. bootstrapped generation.
params: An object of hyperparameters for the model. params: A MiniGoParams instance of hyperparameters for the model.
""" """
bootstrap_name = utils.generate_model_name(0) bootstrap_name = utils.generate_model_name(0)
_ensure_dir_exists(trained_models_dir) _ensure_dir_exists(trained_models_dir)
...@@ -79,41 +79,23 @@ def bootstrap(estimator_model_dir, trained_models_dir, params): ...@@ -79,41 +79,23 @@ def bootstrap(estimator_model_dir, trained_models_dir, params):
dualnet.export_model(estimator_model_dir, bootstrap_model_path) dualnet.export_model(estimator_model_dir, bootstrap_model_path)
def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir, def selfplay(selfplay_dirs, selfplay_model, params):
params):
"""Perform selfplay with a specific model. """Perform selfplay with a specific model.
Args: Args:
model_name: The name of the model used for selfplay. selfplay_dirs: A dict to specify the directories used in selfplay.
trained_models_dir: The path to the model files. selfplay_dirs = {
selfplay_dir: Where to write the games. Set as 'base_dir/data/selfplay/'. 'output_dir': output_dir,
holdout_dir: Where to write the holdout data. Set as 'holdout_dir': holdout_dir,
'base_dir/data/holdout/'. 'clean_sgf': clean_sgf,
sgf_dir: Where to write the sgf (Smart Game Format) files. Set as 'full_sgf': full_sgf
'base_dir/sgf/'. }
params: An object of hyperparameters for the model. selfplay_model: The actual Dualnet runner for selfplay.
params: A MiniGoParams instance of hyperparameters for the model.
""" """
print('Playing a game with model {}'.format(model_name))
# Set paths for the model with 'model_name'
model_path = os.path.join(trained_models_dir, model_name)
output_dir = os.path.join(selfplay_dir, model_name)
holdout_dir = os.path.join(holdout_dir, model_name)
# clean_sgf is to write sgf file without comments.
# full_sgf is to write sgf file with comments.
clean_sgf = os.path.join(sgf_dir, model_name, 'clean')
full_sgf = os.path.join(sgf_dir, model_name, 'full')
_ensure_dir_exists(output_dir)
_ensure_dir_exists(holdout_dir)
_ensure_dir_exists(clean_sgf)
_ensure_dir_exists(full_sgf)
with utils.logged_timer('Loading weights from {} ... '.format(model_path)):
network = dualnet.DualNetRunner(model_path, params)
with utils.logged_timer('Playing game'): with utils.logged_timer('Playing game'):
player = selfplay_mcts.play( player = selfplay_mcts.play(
params.board_size, network, params.selfplay_readouts, params.board_size, selfplay_model, params.selfplay_readouts,
params.selfplay_resign_threshold, params.simultaneous_leaves, params.selfplay_resign_threshold, params.simultaneous_leaves,
params.selfplay_verbose) params.selfplay_verbose)
...@@ -124,8 +106,8 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir, ...@@ -124,8 +106,8 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir,
os.path.join(dir_sgf, '{}.sgf'.format(output_name)), 'w') as f: os.path.join(dir_sgf, '{}.sgf'.format(output_name)), 'w') as f:
f.write(player.to_sgf(use_comments=use_comments)) f.write(player.to_sgf(use_comments=use_comments))
_write_sgf_data(clean_sgf, use_comments=False) _write_sgf_data(selfplay_dirs['clean_sgf'], use_comments=False)
_write_sgf_data(full_sgf, use_comments=True) _write_sgf_data(selfplay_dirs['full_sgf'], use_comments=True)
game_data = player.extract_data() game_data = player.extract_data()
tf_examples = preprocessing.make_dataset_from_selfplay(game_data, params) tf_examples = preprocessing.make_dataset_from_selfplay(game_data, params)
...@@ -133,10 +115,10 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir, ...@@ -133,10 +115,10 @@ def selfplay(model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir,
# Hold out 5% of games for evaluation. # Hold out 5% of games for evaluation.
if random.random() < params.holdout_pct: if random.random() < params.holdout_pct:
fname = os.path.join( fname = os.path.join(
holdout_dir, ('{}'+_TF_RECORD_SUFFIX).format(output_name)) selfplay_dirs['holdout_dir'], output_name + _TF_RECORD_SUFFIX)
else: else:
fname = os.path.join( fname = os.path.join(
output_dir, ('{}'+_TF_RECORD_SUFFIX).format(output_name)) selfplay_dirs['output_dir'], output_name + _TF_RECORD_SUFFIX)
preprocessing.write_tf_examples(fname, tf_examples) preprocessing.write_tf_examples(fname, tf_examples)
...@@ -148,7 +130,7 @@ def gather(selfplay_dir, training_chunk_dir, params): ...@@ -148,7 +130,7 @@ def gather(selfplay_dir, training_chunk_dir, params):
selfplay_dir: Where to look for games. Set as 'base_dir/data/selfplay/'. selfplay_dir: Where to look for games. Set as 'base_dir/data/selfplay/'.
training_chunk_dir: where to put collected games. Set as training_chunk_dir: where to put collected games. Set as
'base_dir/data/training_chunks/'. 'base_dir/data/training_chunks/'.
params: An object of hyperparameters for the model. params: A MiniGoParams instance of hyperparameters for the model.
""" """
# Check the selfplay data from the most recent 50 models. # Check the selfplay data from the most recent 50 models.
_ensure_dir_exists(training_chunk_dir) _ensure_dir_exists(training_chunk_dir)
...@@ -196,22 +178,22 @@ def gather(selfplay_dir, training_chunk_dir, params): ...@@ -196,22 +178,22 @@ def gather(selfplay_dir, training_chunk_dir, params):
f.write('\n'.join(sorted(already_processed))) f.write('\n'.join(sorted(already_processed)))
def train(trained_models_dir, estimator_model_dir, training_chunk_dir, params): def train(trained_models_dir, estimator_model_dir, training_chunk_dir,
generation, params):
"""Train the latest model from gathered data. """Train the latest model from gathered data.
Args: Args:
trained_models_dir: Where to export the completed generation. trained_models_dir: Where to export the completed generation.
estimator_model_dir: tf.estimator model directory. estimator_model_dir: tf.estimator model directory.
training_chunk_dir: Directory where gathered training chunks are. training_chunk_dir: Directory where gathered training chunks are.
params: An object of hyperparameters for the model. generation: Which generation you are training.
params: A MiniGoParams instance of hyperparameters for the model.
""" """
model_num, model_name = utils.get_latest_model(trained_models_dir) new_model_name = utils.generate_model_name(generation)
print('Initializing from model {}'.format(model_name))
new_model_name = utils.generate_model_name(model_num + 1)
print('New model will be {}'.format(new_model_name)) print('New model will be {}'.format(new_model_name))
save_file = os.path.join(trained_models_dir, new_model_name) new_model = os.path.join(trained_models_dir, new_model_name)
print('Training on gathered game data...')
tf_records = sorted( tf_records = sorted(
tf.gfile.Glob(os.path.join(training_chunk_dir, '*'+_TF_RECORD_SUFFIX))) tf.gfile.Glob(os.path.join(training_chunk_dir, '*'+_TF_RECORD_SUFFIX)))
tf_records = tf_records[ tf_records = tf_records[
...@@ -219,8 +201,8 @@ def train(trained_models_dir, estimator_model_dir, training_chunk_dir, params): ...@@ -219,8 +201,8 @@ def train(trained_models_dir, estimator_model_dir, training_chunk_dir, params):
print('Training from: {} to {}'.format(tf_records[0], tf_records[-1])) print('Training from: {} to {}'.format(tf_records[0], tf_records[-1]))
with utils.logged_timer('Training'): with utils.logged_timer('Training'):
dualnet.train(estimator_model_dir, tf_records, model_num + 1, params) dualnet.train(estimator_model_dir, tf_records, generation, params)
dualnet.export_model(estimator_model_dir, save_file) dualnet.export_model(estimator_model_dir, new_model)
def validate(trained_models_dir, holdout_dir, estimator_model_dir, params): def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
...@@ -230,7 +212,7 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params): ...@@ -230,7 +212,7 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
trained_models_dir: Directories where the completed generations/models are. trained_models_dir: Directories where the completed generations/models are.
holdout_dir: Directories where holdout data are. holdout_dir: Directories where holdout data are.
estimator_model_dir: tf.estimator model directory. estimator_model_dir: tf.estimator model directory.
params: An object of hyperparameters for the model. params: A MiniGoParams instance of hyperparameters for the model.
""" """
model_num, _ = utils.get_latest_model(trained_models_dir) model_num, _ = utils.get_latest_model(trained_models_dir)
...@@ -251,6 +233,11 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params): ...@@ -251,6 +233,11 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
tf_records.extend( tf_records.extend(
tf.gfile.Glob(os.path.join(record_dir, '*'+_TF_RECORD_SUFFIX))) tf.gfile.Glob(os.path.join(record_dir, '*'+_TF_RECORD_SUFFIX)))
if not tf_records:
print('No holdout dataset for validation! '
'Please check your holdout directory: {}'.format(holdout_dir))
return
print('The length of tf_records is {}.'.format(len(tf_records))) print('The length of tf_records is {}.'.format(len(tf_records)))
first_tf_record = os.path.basename(tf_records[0]) first_tf_record = os.path.basename(tf_records[0])
last_tf_record = os.path.basename(tf_records[-1]) last_tf_record = os.path.basename(tf_records[-1])
...@@ -259,21 +246,22 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params): ...@@ -259,21 +246,22 @@ def validate(trained_models_dir, holdout_dir, estimator_model_dir, params):
dualnet.validate(estimator_model_dir, tf_records, params) dualnet.validate(estimator_model_dir, tf_records, params)
def evaluate(trained_models_dir, black_model_name, white_model_name, def evaluate(black_model_name, black_net, white_model_name, white_net,
evaluate_dir, params): evaluate_dir, params):
"""Evaluate with two models. """Evaluate with two models.
With the model name, construct two DualNetRunners to play as black and white With two DualNetRunners to play as black and white in a Go match. Two models
in a Go match. Two models play several names, and the model that wins by a play several games, and the model that wins by a margin of 55% will be the
margin of 55% will be the winner. winner.
Args: Args:
trained_models_dir: Directories where the completed generations/models are.
black_model_name: The name of the model playing black. black_model_name: The name of the model playing black.
black_net: The DualNetRunner model for black
white_model_name: The name of the model playing white. white_model_name: The name of the model playing white.
white_net: The DualNetRunner model for white.
evaluate_dir: Where to write the evaluation results. Set as evaluate_dir: Where to write the evaluation results. Set as
'base_dir/sgf/evaluate/'' 'base_dir/sgf/evaluate/'.
params: An object of hyperparameters for the model. params: A MiniGoParams instance of hyperparameters for the model.
Returns: Returns:
The model name of the winner. The model name of the winner.
...@@ -281,19 +269,6 @@ def evaluate(trained_models_dir, black_model_name, white_model_name, ...@@ -281,19 +269,6 @@ def evaluate(trained_models_dir, black_model_name, white_model_name,
Raises: Raises:
ValueError: if neither `WHITE` or `BLACK` is returned. ValueError: if neither `WHITE` or `BLACK` is returned.
""" """
black_model = os.path.join(trained_models_dir, black_model_name)
white_model = os.path.join(trained_models_dir, white_model_name)
print('Evaluate models between {} and {}'.format(
black_model_name, white_model_name))
_ensure_dir_exists(evaluate_dir)
with utils.logged_timer('Loading weights'):
black_net = dualnet.DualNetRunner(black_model, params)
white_net = dualnet.DualNetRunner(white_model, params)
with utils.logged_timer('{} games'.format(params.eval_games)): with utils.logged_timer('{} games'.format(params.eval_games)):
winner = evaluation.play_match( winner = evaluation.play_match(
params, black_net, white_net, params.eval_games, params, black_net, white_net, params.eval_games,
...@@ -305,38 +280,122 @@ def evaluate(trained_models_dir, black_model_name, white_model_name, ...@@ -305,38 +280,122 @@ def evaluate(trained_models_dir, black_model_name, white_model_name,
return black_model_name if winner == go.BLACK_NAME else white_model_name return black_model_name if winner == go.BLACK_NAME else white_model_name
def _set_params_from_board_size(board_size): def _set_params(flags):
"""Set hyperparameters from board size.""" """Set hyperparameters from board size.
Args:
flags: Flags from Argparser.
Returns:
An MiniGoParams instance of hyperparameters.
"""
params = model_params.MiniGoParams() params = model_params.MiniGoParams()
k = utils.round_power_of_two(board_size ** 2 / 3) k = utils.round_power_of_two(flags.board_size ** 2 / 3)
params.num_filters = k # Number of filters in the convolution layer params.num_filters = k # Number of filters in the convolution layer
params.fc_width = 2 * k # Width of each fully connected layer params.fc_width = 2 * k # Width of each fully connected layer
params.num_shared_layers = board_size # Number of shared trunk layers params.num_shared_layers = flags.board_size # Number of shared trunk layers
params.board_size = board_size # Board size params.board_size = flags.board_size # Board size
# How many positions can fit on a graphics card. 256 for 9s, 16 or 32 for 19s. # How many positions can fit on a graphics card. 256 for 9s, 16 or 32 for 19s.
if FLAGS.board_size == 9: if flags.batch_size is None:
params.batch_size = 256 if flags.board_size == 9:
params.batch_size = 256
else:
params.batch_size = 32
else: else:
params.batch_size = 32 params.batch_size = flags.batch_size
return params return params
def _prepare_selfplay(
model_name, trained_models_dir, selfplay_dir, holdout_dir, sgf_dir, params):
"""Set directories and load the network for selfplay.
Args:
model_name: The name of the model for self-play
trained_models_dir: Directories where the completed generations/models are.
selfplay_dir: Where to write the games. Set as 'base_dir/data/selfplay/'.
holdout_dir: Where to write the holdout data. Set as
'base_dir/data/holdout/'.
sgf_dir: Where to write the sgf (Smart Game Format) files. Set as
'base_dir/sgf/'.
params: A MiniGoParams instance of hyperparameters for the model.
Returns:
The directories and network model for selfplay.
"""
# Set paths for the model with 'model_name'
model_path = os.path.join(trained_models_dir, model_name)
output_dir = os.path.join(selfplay_dir, model_name)
holdout_dir = os.path.join(holdout_dir, model_name)
# clean_sgf is to write sgf file without comments.
# full_sgf is to write sgf file with comments.
clean_sgf = os.path.join(sgf_dir, model_name, 'clean')
full_sgf = os.path.join(sgf_dir, model_name, 'full')
_ensure_dir_exists(output_dir)
_ensure_dir_exists(holdout_dir)
_ensure_dir_exists(clean_sgf)
_ensure_dir_exists(full_sgf)
selfplay_dirs = {
'output_dir': output_dir,
'holdout_dir': holdout_dir,
'clean_sgf': clean_sgf,
'full_sgf': full_sgf
}
# cache the network model for self-play
with utils.logged_timer('Loading weights from {} ... '.format(model_path)):
network = dualnet.DualNetRunner(model_path, params)
return selfplay_dirs, network
def run_selfplay(selfplay_model, selfplay_games, dirs, params):
"""Run selfplay to generate training data.
Args:
selfplay_model: The model name for selfplay.
selfplay_games: The number of selfplay games.
dirs: A MiniGoDirectory instance of directories used in each step.
params: A MiniGoParams instance of hyperparameters for the model.
"""
selfplay_dirs, network = _prepare_selfplay(
selfplay_model, dirs.trained_models_dir, dirs.selfplay_dir,
dirs.holdout_dir, dirs.sgf_dir, params)
print('Self-play with model: {}'.format(selfplay_model))
for _ in range(selfplay_games):
selfplay(selfplay_dirs, network, params)
def main(_): def main(_):
"""Run the reinforcement learning loop.""" """Run the reinforcement learning loop."""
tf.logging.set_verbosity(tf.logging.INFO) tf.logging.set_verbosity(tf.logging.INFO)
params = _set_params_from_board_size(FLAGS.board_size) params = _set_params(FLAGS)
# A dummy model for debug/testing purpose with fewer games and iterations # A dummy model for debug/testing purpose with fewer games and iterations
if FLAGS.debug: if FLAGS.test:
params = model_params.DummyMiniGoParams() params = model_params.DummyMiniGoParams()
base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_size_dummy/'
else:
# Set directories for models and datasets
base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_size/'
# Set directories for models and datasets
base_dir = FLAGS.base_dir + str(FLAGS.board_size) + '_board_size/'
dirs = utils.MiniGoDirectory(base_dir) dirs = utils.MiniGoDirectory(base_dir)
# Run selfplay only if user specifies the argument.
if FLAGS.selfplay:
selfplay_model_name = FLAGS.selfplay_model_name or utils.get_latest_model(
dirs.trained_models_dir)[1]
max_games = FLAGS.selfplay_max_games or params.max_games_per_generation
run_selfplay(selfplay_model_name, max_games, dirs, params)
return
# Run the RL pipeline
# if no models have been trained, start from bootstrap model # if no models have been trained, start from bootstrap model
if os.path.isdir(base_dir) is False:
if not os.path.isdir(dirs.trained_models_dir):
print('No trained model exists! Starting from Bootstrap...') print('No trained model exists! Starting from Bootstrap...')
print('Creating random initial weights...') print('Creating random initial weights...')
bootstrap(dirs.estimator_model_dir, dirs.trained_models_dir, params) bootstrap(dirs.estimator_model_dir, dirs.trained_models_dir, params)
...@@ -345,50 +404,51 @@ def main(_): ...@@ -345,50 +404,51 @@ def main(_):
print('Start from the last checkpoint...') print('Start from the last checkpoint...')
_, best_model_so_far = utils.get_latest_model(dirs.trained_models_dir) _, best_model_so_far = utils.get_latest_model(dirs.trained_models_dir)
for rl_iter in range(params.max_iters_per_pipeline): for rl_iter in range(params.max_iters_per_pipeline):
print('RL_iteration: {}'.format(rl_iter)) print('RL_iteration: {}'.format(rl_iter))
# Self-play with the best model to generate training data
run_selfplay(
best_model_so_far, params.max_games_per_generation, dirs, params)
# Self-play to generate at least params.max_games_per_generation games # gather selfplay data for training
selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
dirs.holdout_dir, dirs.sgf_dir, params)
games = tf.gfile.Glob(
os.path.join(dirs.selfplay_dir, best_model_so_far, '*.zz'))
while len(games) < params.max_games_per_generation:
selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
dirs.holdout_dir, dirs.sgf_dir, params)
if FLAGS.validation:
params = model_params.DummyValidationParams()
selfplay(best_model_so_far, dirs.trained_models_dir, dirs.selfplay_dir,
dirs.holdout_dir, dirs.sgf_dir, params)
games = tf.gfile.Glob(
os.path.join(dirs.selfplay_dir, best_model_so_far, '*.zz'))
print('Gathering game output...') print('Gathering game output...')
gather(dirs.selfplay_dir, dirs.training_chunk_dir, params) gather(dirs.selfplay_dir, dirs.training_chunk_dir, params)
# train the next generation model
model_num, _ = utils.get_latest_model(dirs.trained_models_dir)
print('Training on gathered game data...') print('Training on gathered game data...')
train(dirs.trained_models_dir, dirs.estimator_model_dir, train(dirs.trained_models_dir, dirs.estimator_model_dir,
dirs.training_chunk_dir, params) dirs.training_chunk_dir, model_num + 1, params)
# validate the latest model if needed
if FLAGS.validation: if FLAGS.validation:
print('Validating on the holdout game data...') print('Validating on the holdout game data...')
validate(dirs.trained_models_dir, dirs.holdout_dir, validate(dirs.trained_models_dir, dirs.holdout_dir,
dirs.estimator_model_dir, params) dirs.estimator_model_dir, params)
_, current_model = utils.get_latest_model(dirs.trained_models_dir) _, current_model = utils.get_latest_model(dirs.trained_models_dir)
if FLAGS.evaluation: # Perform evaluation if needed if FLAGS.evaluation: # Perform evaluation if needed
print('Evaluating the latest model...') print('Evaluate models between {} and {}'.format(
best_model_so_far, current_model))
black_model = os.path.join(dirs.trained_models_dir, best_model_so_far)
white_model = os.path.join(dirs.trained_models_dir, current_model)
_ensure_dir_exists(dirs.evaluate_dir)
with utils.logged_timer('Loading weights'):
black_net = dualnet.DualNetRunner(black_model, params)
white_net = dualnet.DualNetRunner(white_model, params)
best_model_so_far = evaluate( best_model_so_far = evaluate(
dirs.trained_models_dir, best_model_so_far, current_model, best_model_so_far, black_net, current_model, white_net,
dirs.evaluate_dir, params) dirs.evaluate_dir, params)
print('Winner: {}!'.format(best_model_so_far)) print('Winner of evaluation: {}!'.format(best_model_so_far))
else: else:
best_model_so_far = current_model best_model_so_far = current_model
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
# flags to run the RL pipeline
parser.add_argument( parser.add_argument(
'--base_dir', '--base_dir',
type=str, type=str,
...@@ -402,18 +462,45 @@ if __name__ == '__main__': ...@@ -402,18 +462,45 @@ if __name__ == '__main__':
metavar='N', metavar='N',
choices=[9, 19], choices=[9, 19],
help='Go board size. The default size is 9.') help='Go board size. The default size is 9.')
parser.add_argument(
'--batch_size',
type=int,
default=None,
metavar='BS',
help='Batch size for training. The default size is None')
# Test the pipeline with a dummy model
parser.add_argument(
'--test',
action='store_true',
help='A boolean to test RL pipeline with a dummy model.')
# Run RL pipeline with the validation step
parser.add_argument(
'--validation',
action='store_true',
help='A boolean to specify validation in the RL pipeline.')
# Run RL pipeline with the evaluation step
parser.add_argument( parser.add_argument(
'--evaluation', '--evaluation',
action='store_true', action='store_true',
help='A boolean to specify evaluation in the RL pipeline.') help='A boolean to specify evaluation in the RL pipeline.')
# self-play only
parser.add_argument( parser.add_argument(
'--debug', '--selfplay',
action='store_true', action='store_true',
help='A boolean to indicate debug mode for testing purpose.') help='A boolean to run self-play only.')
parser.add_argument( parser.add_argument(
'--validation', '--selfplay_model_name',
action='store_true', type=str,
help='A boolean to explicitly generate holdout data for validation.') default=None,
metavar='SM',
help='The model used for self-play only.')
parser.add_argument(
'--selfplay_max_games',
type=int,
default=None,
metavar='SMG',
help='The number of game data self-play only needs to generate')
FLAGS, unparsed = parser.parse_known_args() FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
class MiniGoParams(object): class MiniGoParams(object):
"""Parameters for MiniGo.""" """Parameters for MiniGo."""
# Go params # Go board size
board_size = 9 board_size = 9
# RL pipeline # RL pipeline
...@@ -51,6 +51,7 @@ class MiniGoParams(object): ...@@ -51,6 +51,7 @@ class MiniGoParams(object):
# the number of simultaneous leaves in MCTS # the number of simultaneous leaves in MCTS
simultaneous_leaves = 8 simultaneous_leaves = 8
# holdout data for validation
holdout_pct = 0.05 # How many games to hold out for validation holdout_pct = 0.05 # How many games to hold out for validation
holdout_generation = 50 # How many recent generations/models for holdout data holdout_generation = 50 # How many recent generations/models for holdout data
...@@ -63,7 +64,7 @@ class MiniGoParams(object): ...@@ -63,7 +64,7 @@ class MiniGoParams(object):
# AGZ used the most recent 500k games, which, assuming 250 moves/game = 125M # AGZ used the most recent 500k games, which, assuming 250 moves/game = 125M
train_window_size = 125000000 train_window_size = 125000000
# evaluation # evaluation with two models
eval_games = 50 # The number of games to play in evaluation eval_games = 50 # The number of games to play in evaluation
eval_readouts = 100 # How many readouts to make per move in evaluation eval_readouts = 100 # How many readouts to make per move in evaluation
eval_verbose = 1 # How verbose the players should be in evaluation eval_verbose = 1 # How verbose the players should be in evaluation
......
...@@ -205,7 +205,7 @@ class MiniGoDirectory(object): ...@@ -205,7 +205,7 @@ class MiniGoDirectory(object):
"""The class to set up directories of MiniGo.""" """The class to set up directories of MiniGo."""
def __init__(self, base_dir): def __init__(self, base_dir):
self.trained_models_dir = os.path.join(base_dir, 'models') self.trained_models_dir = os.path.join(base_dir, 'trained_models')
self.estimator_model_dir = os.path.join(base_dir, 'estimator_model_dir/') self.estimator_model_dir = os.path.join(base_dir, 'estimator_model_dir/')
self.selfplay_dir = os.path.join(base_dir, 'data/selfplay/') self.selfplay_dir = os.path.join(base_dir, 'data/selfplay/')
self.holdout_dir = os.path.join(base_dir, 'data/holdout/') self.holdout_dir = os.path.join(base_dir, 'data/holdout/')
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment