- 05 Sep, 2018 2 commits
-
-
Reed authored
* Fix spurious "did not start correctly" error. The error "Generation subprocess did not start correctly" would occur if the async process started up after the main process checked for the subproc_alive file. * Add error message
-
Reed authored
When constructing the evaluation records, data_async_generation.py would copy the records into the final directory. The main process would wait until the eval records existed. However, the main process would sometimes read the eval records before they were fully copied, causing a DataLossError.
-
- 04 Sep, 2018 1 commit
-
-
Yanhui Liang authored
-
- 02 Sep, 2018 2 commits
- 01 Sep, 2018 2 commits
- 30 Aug, 2018 1 commit
-
-
Aman Gupta authored
Bypassing Export model step, if training on TPU's. As this need inference to be supported on TPU's. Remove this check once inference is supported. (#5209)
-
- 29 Aug, 2018 1 commit
-
-
Yanhui Liang authored
* Add distribution strategy to keras benchmark * Fix comments * Fix lints
-
- 28 Aug, 2018 2 commits
-
-
Jaeman authored
* Fix bug on distributed training in mnist using MirroredStrategy API * Remove unnecessary codes and chagne distribution strategy source - Remove multi-gpu - Remove TowerOptimizer - Change from MirroredStrategy to distribution_utils.get_distribution_strategy
-
Josh Gordon authored
-
- 27 Aug, 2018 2 commits
-
-
Taylor Robie authored
* Make ResNet robust to the case that epochs_between_evals does not divide train_epochs, and add an --eval_only option * add some comments to make the control flow easier to follow * address PR comments
-
Toby Boyd authored
* Add 5 epoch warmup * get_lr with warm_up only for imagenet * Add base_lr, remove fp16 unittest arg validation * Remove validation check stopping v1 and FP16
-
- 25 Aug, 2018 1 commit
-
-
Toby Boyd authored
* Add top_5 to to eval. * labels shape to [?] from [?,1] matches unittest.
-
- 22 Aug, 2018 1 commit
-
-
Reed authored
* Fix convergence issues for MLPerf. Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function. This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates. Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation. I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353. * fix lint error * Fix failing test * Address @robieta's feedback * Address more feedback
-
- 20 Aug, 2018 1 commit
-
-
Taylor Robie authored
* perform a codecs check and remove unicode \ufeff if utf-8 is not present * delint
-
- 18 Aug, 2018 1 commit
-
-
Reed authored
This is done by using a higher Pickle protocol version, which the Python docs describe as being "slightly more efficient". This reduces the file write time at the beginning from 2 1/2 minutes to 5 seconds.
-
- 16 Aug, 2018 2 commits
-
-
Jules Gagnon-Marchand authored
* Deterministic dataset order fix In order for the order of the files to be deterministic, in `tf.data.Dataset.list_files(..., shuffle)`, shuffle needs to be True, otherwise different iterator inits will yield different file orders * removed unnecessary shuffle of filenames * Removed the `_FILE_SHUFFLE_BUFFER` definition
-
Taylor Robie authored
-
- 15 Aug, 2018 1 commit
-
-
Wei Wang authored
-
- 14 Aug, 2018 2 commits
-
-
alope107 authored
* Fix Transformer TPU crash in Python 2.X. - Tensorflow raises an error when tf_inspect.getfullargspec is called on a functools.partial in Python 2.X. This issue would be hit during the eval stage of the Transformer TPU model. This change replaces the call to functools.partial with a lambda to work around the issue. * Remove unused import from transformer_main. * Fix lint error.
-
Zac Wellmer authored
* warm start a resent with all but the dense layer and only update the final layer weights when fine tuning * Update README for Transfer Learning * make lint happy and variable naming error related to scaled gradients * edit the test cases for cifar10 and imagenet to reflect the default case of no fine tuning
-
- 13 Aug, 2018 1 commit
-
-
kangtop729 authored
There is a typing error.
-
- 10 Aug, 2018 1 commit
-
-
Yanhui Liang authored
-
- 02 Aug, 2018 2 commits
-
-
Reed authored
-
Reed authored
The data_async_generation.py process would print to stderr, but the main process would redirect it's stderr to a pipe. The main process never read from the pipe, so when the pipe was full, data_async_generation.py would stall on a write to stderr. This change makes data_async_generation.py not write to stdout/stderr.
-
- 01 Aug, 2018 1 commit
-
-
Reed authored
The output of an embeddding layer is already flattened, so the Flatten layers acted as no-ops.
-
- 31 Jul, 2018 8 commits
-
-
Taylor Robie authored
-
Reed authored
* Fix crash when Python interpreter not on PATH. * Fix lint error.
-
Reed authored
-
Reed authored
-
Taylor Robie authored
* add indirection file * remove unused imports * fix import
-
Reed authored
-
Reed authored
-
Reed authored
-
- 30 Jul, 2018 2 commits
-
-
Taylor Robie authored
* intermediate commit * ncf now working * reorder pipeline * allow batched decode for file backed dataset * fix bug * more tweaks * parallize false negative generation * shared pool hack * workers ignore sigint * intermediate commit * simplify buffer backed dataset creation to fixed length record approach only. (more cleanup needed) * more tweaks * simplify pipeline * fix misplaced cleanup() calls. (validation works\!) * more tweaks * sixify memoryview usage * more sixification * fix bug * add future imports * break up training input pipeline * more pipeline tuning * first pass at moving negative generation to async * refactor async pipeline to use files instead of ipc * refactor async pipeline * move expansion and concatenation from reduce worker to generation workers * abandon complete async due to interactions with the tensorflow threadpool * cleanup * remove performance_comparison.py * experiment with rough generator + interleave pipeline * yet more pipeline tuning * update on-the-fly pipeline * refactor preprocessing, and move train generation behind a GRPC server * fix leftover call * intermediate commit * intermediate commit * fix index error in data pipeline, and add logging to train data server * make sharding more robust to imbalance * correctly sample with replacement * file buffers are no longer needed for this branch * tweak sampling methods * add README for data pipeline * fix eval sampling, and vectorize eval metrics * add spillover and static training batch sizes * clean up cruft from earlier iterations * rough delint * delint 2 / n * add type annotations * update run script * make run.sh a bit nicer * change embedding initializer to match reference * rough pass at pure estimator model_fn * impose static shape hack (revisit later) * refinements * fix dir error in run.sh * add documentation * add more docs and fix an assert * old data test is no longer valid. Keeping it around as reference for the new one * rough draft of data pipeline validation script * don't rely on shuffle default * tweaks and documentation * add separate eval batch size for performance * initial commit * terrible hacking * mini hacks * missed a bug * messing about trying to get TPU running * TFRecords based TPU attempt * bug fixes * don't log remotely * more bug fixes * TPU tweaks and bug fixes * more tweaks * more adjustments * rework model definition * tweak data pipeline * refactor async TFRecords generation * temp commit to run.sh * update log behavior * fix logging bug * add check for subprocess start to avoid cryptic hangs * unify deserialize and make it TPU compliant * delint * remove gRPC pipeline code * fix logging bug * delint and remove old test files * add unit tests for NCF pipeline * delint * clean up run.sh, and add run_tpu.sh * forgot the most important line * fix run.sh bugs * yet more bash debugging * small tweak to add keras summaries to model_fn * Clean up sixification issues * address PR comments * delinting is never over
-
Sundara Tejaswi Digumarti authored
Removed the conditional over distributed strategies when computing metrics. Metrics are now computed even when distributed strategies are used.
-
- 26 Jul, 2018 1 commit
-
-
Jiang Yu authored
* fix batch_size in transformer_main.py fix batch_size in transformer_main.py which causes ResourceExhaustedError: OOM during training Transformer models using models/official/transformer * small format change change format from one line to multiple ones in order to pass lint tests * remove trailing space and add comment
-
- 21 Jul, 2018 1 commit
-
-
Igor Ganichev authored
float32 should be fine for mnist loss and accuracy metrics and float64 is not available on TPUs.
-
- 20 Jul, 2018 1 commit
-
-
Yanhui Liang authored
* Add more arguments * Add eager mode * Add notes for eager mode * Address the comments * Fix argument typos * Add warning for eager and multi-gpu * Fix typo * Fix notes * Fix pylint
-