1. 05 Sep, 2018 2 commits
    • Reed's avatar
      Fix spurious "did not start correctly" error. (#5252) · 7babedc5
      Reed authored
      * Fix spurious "did not start correctly" error.
      
      The error "Generation subprocess did not start correctly" would occur if the async process started up after the main process checked for the subproc_alive file.
      
      * Add error message
      7babedc5
    • Reed's avatar
      Fix crash caused by race in the async process. (#5250) · 5856878d
      Reed authored
      When constructing the evaluation records, data_async_generation.py would copy the records into the final directory. The main process would wait until the eval records existed. However, the main process would sometimes read the eval records before they were fully copied, causing a DataLossError.
      5856878d
  2. 04 Sep, 2018 1 commit
  3. 02 Sep, 2018 2 commits
  4. 01 Sep, 2018 2 commits
  5. 30 Aug, 2018 1 commit
  6. 29 Aug, 2018 1 commit
  7. 28 Aug, 2018 2 commits
  8. 27 Aug, 2018 2 commits
    • Taylor Robie's avatar
      ResNet eval_only mode (#5186) · d1c48afc
      Taylor Robie authored
      * Make ResNet robust to the case that epochs_between_evals does not divide train_epochs, and add an --eval_only option
      
      * add some comments to make the control flow easier to follow
      
      * address PR comments
      d1c48afc
    • Toby Boyd's avatar
      Add 5 epoch warmup to resnet (#5176) · 9bf586de
      Toby Boyd authored
      * Add 5 epoch warmup
      
      * get_lr with warm_up only for imagenet
      
      * Add base_lr, remove fp16 unittest arg validation
      
      * Remove validation check stopping v1 and FP16
      9bf586de
  9. 25 Aug, 2018 1 commit
  10. 22 Aug, 2018 1 commit
    • Reed's avatar
      Fix convergence issues for MLPerf. (#5161) · 64710c05
      Reed authored
      * Fix convergence issues for MLPerf.
      
      Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function.
      
      This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates.
      
      Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation.
      
      I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353.
      
      * fix lint error
      
      * Fix failing test
      
      * Address @robieta's feedback
      
      * Address more feedback
      64710c05
  11. 20 Aug, 2018 1 commit
  12. 18 Aug, 2018 1 commit
    • Reed's avatar
      Speed up cache construction. (#5131) · 5aee67b4
      Reed authored
      This is done by using a higher Pickle protocol version, which the Python docs describe as being "slightly more efficient". This reduces the file write time at the beginning from 2 1/2 minutes to 5 seconds.
      5aee67b4
  13. 16 Aug, 2018 2 commits
  14. 15 Aug, 2018 1 commit
  15. 14 Aug, 2018 2 commits
    • alope107's avatar
      Transformer partial fix (#5092) · 6f5967a0
      alope107 authored
      * Fix Transformer TPU crash in Python 2.X.
      
      - Tensorflow raises an error when tf_inspect.getfullargspec is called on
      a functools.partial in Python 2.X. This issue would be hit during the
      eval stage of the Transformer TPU model. This change replaces the call
      to functools.partial with a lambda to work around the issue.
      
      * Remove unused import from transformer_main.
      
      * Fix lint error.
      6f5967a0
    • Zac Wellmer's avatar
      Resnet transfer learning (#5047) · 7bffd37b
      Zac Wellmer authored
      * warm start a resent with all but the dense layer and only update the final layer weights when fine tuning
      
      * Update README for Transfer Learning
      
      * make lint happy and variable naming error related to scaled gradients
      
      * edit the test cases for cifar10 and imagenet to reflect the default case of no fine tuning
      7bffd37b
  16. 13 Aug, 2018 1 commit
  17. 10 Aug, 2018 1 commit
  18. 02 Aug, 2018 2 commits
  19. 01 Aug, 2018 1 commit
  20. 31 Jul, 2018 8 commits
  21. 30 Jul, 2018 2 commits
    • Taylor Robie's avatar
      NCF pipeline refactor (take 2) and initial TPU port. (#4935) · 6518c1c7
      Taylor Robie authored
      * intermediate commit
      
      * ncf now working
      
      * reorder pipeline
      
      * allow batched decode for file backed dataset
      
      * fix bug
      
      * more tweaks
      
      * parallize false negative generation
      
      * shared pool hack
      
      * workers ignore sigint
      
      * intermediate commit
      
      * simplify buffer backed dataset creation to fixed length record approach only. (more cleanup needed)
      
      * more tweaks
      
      * simplify pipeline
      
      * fix misplaced cleanup() calls. (validation works\!)
      
      * more tweaks
      
      * sixify memoryview usage
      
      * more sixification
      
      * fix bug
      
      * add future imports
      
      * break up training input pipeline
      
      * more pipeline tuning
      
      * first pass at moving negative generation to async
      
      * refactor async pipeline to use files instead of ipc
      
      * refactor async pipeline
      
      * move expansion and concatenation from reduce worker to generation workers
      
      * abandon complete async due to interactions with the tensorflow threadpool
      
      * cleanup
      
      * remove performance_comparison.py
      
      * experiment with rough generator + interleave pipeline
      
      * yet more pipeline tuning
      
      * update on-the-fly pipeline
      
      * refactor preprocessing, and move train generation behind a GRPC server
      
      * fix leftover call
      
      * intermediate commit
      
      * intermediate commit
      
      * fix index error in data pipeline, and add logging to train data server
      
      * make sharding more robust to imbalance
      
      * correctly sample with replacement
      
      * file buffers are no longer needed for this branch
      
      * tweak sampling methods
      
      * add README for data pipeline
      
      * fix eval sampling, and vectorize eval metrics
      
      * add spillover and static training batch sizes
      
      * clean up cruft from earlier iterations
      
      * rough delint
      
      * delint 2 / n
      
      * add type annotations
      
      * update run script
      
      * make run.sh a bit nicer
      
      * change embedding initializer to match reference
      
      * rough pass at pure estimator model_fn
      
      * impose static shape hack (revisit later)
      
      * refinements
      
      * fix dir error in run.sh
      
      * add documentation
      
      * add more docs and fix an assert
      
      * old data test is no longer valid. Keeping it around as reference for the new one
      
      * rough draft of data pipeline validation script
      
      * don't rely on shuffle default
      
      * tweaks and documentation
      
      * add separate eval batch size for performance
      
      * initial commit
      
      * terrible hacking
      
      * mini hacks
      
      * missed a bug
      
      * messing about trying to get TPU running
      
      * TFRecords based TPU attempt
      
      * bug fixes
      
      * don't log remotely
      
      * more bug fixes
      
      * TPU tweaks and bug fixes
      
      * more tweaks
      
      * more adjustments
      
      * rework model definition
      
      * tweak data pipeline
      
      * refactor async TFRecords generation
      
      * temp commit to run.sh
      
      * update log behavior
      
      * fix logging bug
      
      * add check for subprocess start to avoid cryptic hangs
      
      * unify deserialize and make it TPU compliant
      
      * delint
      
      * remove gRPC pipeline code
      
      * fix logging bug
      
      * delint and remove old test files
      
      * add unit tests for NCF pipeline
      
      * delint
      
      * clean up run.sh, and add run_tpu.sh
      
      * forgot the most important line
      
      * fix run.sh bugs
      
      * yet more bash debugging
      
      * small tweak to add keras summaries to model_fn
      
      * Clean up sixification issues
      
      * address PR comments
      
      * delinting is never over
      6518c1c7
    • Sundara Tejaswi Digumarti's avatar
      Compute metrics under distributed strategies. (#4942) · a88b89be
      Sundara Tejaswi Digumarti authored
      Removed the conditional over distributed strategies when computing metrics.
      Metrics are now computed even when distributed strategies are used.
      a88b89be
  22. 26 Jul, 2018 1 commit
    • Jiang Yu's avatar
      fix batch_size in transformer_main.py (#4897) · 2d7a0d6a
      Jiang Yu authored
      * fix batch_size in transformer_main.py
      
      fix batch_size in transformer_main.py which causes ResourceExhaustedError: OOM during training Transformer models using models/official/transformer
      
      * small format change
      
      change format from one line to multiple ones in order to pass lint tests
      
      * remove trailing space and add comment
      2d7a0d6a
  23. 21 Jul, 2018 1 commit
  24. 20 Jul, 2018 1 commit
    • Yanhui Liang's avatar
      Add eager for keras benchmark (#4825) · 2689c9ae
      Yanhui Liang authored
      * Add more arguments
      
      * Add eager mode
      
      * Add notes for eager mode
      
      * Address the comments
      
      * Fix argument typos
      
      * Add warning for eager and multi-gpu
      
      * Fix typo
      
      * Fix notes
      
      * Fix pylint
      2689c9ae