Commits · 7babedc55cccc1adb27f58ecdc3ce2d7baa269b2 · ModelZoo / ResNet50_tensorflow

05 Sep, 2018 2 commits

Fix spurious "did not start correctly" error. (#5252) · 7babedc5

Reed authored Sep 05, 2018

* Fix spurious "did not start correctly" error.

The error "Generation subprocess did not start correctly" would occur if the async process started up after the main process checked for the subproc_alive file.

* Add error message

7babedc5

Fix crash caused by race in the async process. (#5250) · 5856878d

Reed authored Sep 05, 2018

When constructing the evaluation records, data_async_generation.py would copy the records into the final directory. The main process would wait until the eval records existed. However, the main process would sometimes read the eval records before they were fully copied, causing a DataLossError.

5856878d

04 Sep, 2018 1 commit
- Update on_finish from async to sync (#5242) · e0f6a392
  Yanhui Liang authored Sep 04, 2018
  
  e0f6a392
02 Sep, 2018 2 commits
- tweak synth input_fn comments · 967133c1
  Toby Boyd authored Sep 02, 2018
  
  967133c1
- Improve synthic data performance · c9972ad6
  Toby Boyd authored Sep 02, 2018
  
  c9972ad6
01 Sep, 2018 2 commits
- Update README · 76c0ac54
  Toby Boyd authored Sep 01, 2018
  
  76c0ac54
- Change default to v1 and 90 epochs · 7b21c9f7
  Toby Boyd authored Sep 01, 2018
  
  7b21c9f7
30 Aug, 2018 1 commit

Bypassing Export model step, if training on TPU's. As this need inference to... · 23b5b422

Aman Gupta authored Aug 30, 2018

Bypassing Export model step, if training on TPU's. As this need inference to be supported on TPU's. Remove this check once inference is supported. (#5209)

23b5b422

29 Aug, 2018 1 commit
- Add distribution strategy to keras benchmark (#5188) · 28863de1
  Yanhui Liang authored Aug 29, 2018
```
* Add distribution strategy to keras benchmark

* Fix comments

* Fix lints
```
  28863de1
28 Aug, 2018 2 commits

Fix bug on distributed training in mnist using MirroredStrategy API (#5183) · 6a0dda1f

Jaeman authored Aug 29, 2018

* Fix bug on distributed training in mnist using MirroredStrategy API

* Remove unnecessary codes and chagne distribution strategy source
- Remove multi-gpu
- Remove TowerOptimizer
- Change from MirroredStrategy to distribution_utils.get_distribution_strategy

6a0dda1f

Adding a note on fairness (#5197) · 0d105c32
Josh Gordon authored Aug 28, 2018

0d105c32

27 Aug, 2018 2 commits

ResNet eval_only mode (#5186) · d1c48afc

Taylor Robie authored Aug 27, 2018

* Make ResNet robust to the case that epochs_between_evals does not divide train_epochs, and add an --eval_only option

* add some comments to make the control flow easier to follow

* address PR comments

d1c48afc

Add 5 epoch warmup to resnet (#5176) · 9bf586de

Toby Boyd authored Aug 27, 2018

* Add 5 epoch warmup

* get_lr with warm_up only for imagenet

* Add base_lr, remove fp16 unittest arg validation

* Remove validation check stopping v1 and FP16

9bf586de

25 Aug, 2018 1 commit
- Add top_5 to to eval to resnet (#5178) · acb0ea4e
  Toby Boyd authored Aug 24, 2018
```
* Add top_5 to to eval.

* labels shape to [?] from [?,1] matches unittest.
```
  acb0ea4e
22 Aug, 2018 1 commit

Fix convergence issues for MLPerf. (#5161) · 64710c05

Reed authored Aug 22, 2018

* Fix convergence issues for MLPerf.

Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function.

This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates.

Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation.

I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353.

* fix lint error

* Fix failing test

* Address @robieta's feedback

* Address more feedback

64710c05

20 Aug, 2018 1 commit
- Strip \ufeff if system cannot support unicode (#5145) · d089975f
  Taylor Robie authored Aug 20, 2018
```
* perform a codecs check and remove unicode \ufeff if utf-8 is not present

* delint
```
  d089975f
18 Aug, 2018 1 commit

Speed up cache construction. (#5131) · 5aee67b4

Reed authored Aug 17, 2018

This is done by using a higher Pickle protocol version, which the Python docs describe as being "slightly more efficient". This reduces the file write time at the beginning from 2 1/2 minutes to 5 seconds.

5aee67b4

16 Aug, 2018 2 commits

Deterministic dataset order fix (#5098) · 468d8bb6

Jules Gagnon-Marchand authored Aug 16, 2018

* Deterministic dataset order fix

In order for the order of the files to be deterministic, in `tf.data.Dataset.list_files(..., shuffle)`, shuffle needs to be True, otherwise different iterator inits will yield different file orders

* removed unnecessary shuffle of filenames

* Removed the `_FILE_SHUFFLE_BUFFER` definition

468d8bb6

use existing inter and intra flags, and fix wide deep test. (#5110) · 909ee1b3
Taylor Robie authored Aug 16, 2018

909ee1b3

15 Aug, 2018 1 commit
- Add Inter/Intra_op_parallelism_threads Support to Wide and Deep (#5046) · 55d55abc
  Wei Wang authored Aug 15, 2018
  
  55d55abc
14 Aug, 2018 2 commits

Transformer partial fix (#5092) · 6f5967a0

alope107 authored Aug 14, 2018

* Fix Transformer TPU crash in Python 2.X.

- Tensorflow raises an error when tf_inspect.getfullargspec is called on
a functools.partial in Python 2.X. This issue would be hit during the
eval stage of the Transformer TPU model. This change replaces the call
to functools.partial with a lambda to work around the issue.

* Remove unused import from transformer_main.

* Fix lint error.

6f5967a0

Resnet transfer learning (#5047) · 7bffd37b

Zac Wellmer authored Aug 14, 2018

* warm start a resent with all but the dense layer and only update the final layer weights when fine tuning

* Update README for Transfer Learning

* make lint happy and variable naming error related to scaled gradients

* edit the test cases for cifar10 and imagenet to reflect the default case of no fine tuning

7bffd37b

13 Aug, 2018 1 commit
- Update resnet_model.py (#5077) · 8b768f90
  kangtop729 authored Aug 13, 2018
```
There is a typing error.
```
  8b768f90
10 Aug, 2018 1 commit
- Fix typos of model name (#5063) · 2472278c
  Yanhui Liang authored Aug 10, 2018
  
  2472278c
02 Aug, 2018 2 commits

Fix docstrings in data_preprocessing.py. (#4976) · f0e10716
Reed authored Aug 02, 2018

f0e10716

Fix bug where data_async_generation.py would freeze. (#4989) · 58037d2c

Reed authored Aug 02, 2018

The data_async_generation.py process would print to stderr, but the main process would redirect it's stderr to a pipe. The main process never read from the pipe, so when the pipe was full, data_async_generation.py would stall on a write to stderr. This change makes data_async_generation.py not write to stdout/stderr.

58037d2c

01 Aug, 2018 1 commit

Remove redundant flatten layers. (#4964) · d64bcfe3

Reed authored Jul 31, 2018

The output of an embeddding layer is already flattened, so the Flatten layers acted as no-ops.

d64bcfe3

31 Jul, 2018 8 commits
- make ncf dataset test less brittle (#4960) · ea287cbf
  Taylor Robie authored Jul 31, 2018
  
  ea287cbf
- Fix crash when Python interpreter not on PATH. (#4961) · e4ab5a86
  Reed authored Jul 31, 2018
```
* Fix crash when Python interpreter not on PATH.

* Fix lint error.
```
  e4ab5a86
- Fix another crash on single-core systems. (#4962) · c242705b
  Reed authored Jul 31, 2018
  
  c242705b
- Fix comments. (#4959) · c708c1b4
  Reed authored Jul 31, 2018
  
  c708c1b4
- Add indirection file to NCF async process. (#4958) · c6bef65a
  Taylor Robie authored Jul 31, 2018
```
* add indirection file

* remove unused imports

* fix import
```
  c6bef65a
- Fix crash on single-core systems. (#4957) · abc62005
  Reed authored Jul 31, 2018
  
  abc62005
- Do not download if --download_if_missing=False (#4956) · e6353fe5
  Reed authored Jul 31, 2018
  
  e6353fe5
- Fix crash when --eval_batch_size is not set. (#4955) · e4034bef
  Reed authored Jul 31, 2018
  
  e4034bef
30 Jul, 2018 2 commits

NCF pipeline refactor (take 2) and initial TPU port. (#4935) · 6518c1c7

Taylor Robie authored Jul 30, 2018

* intermediate commit

* ncf now working

* reorder pipeline

* allow batched decode for file backed dataset

* fix bug

* more tweaks

* parallize false negative generation

* shared pool hack

* workers ignore sigint

* intermediate commit

* simplify buffer backed dataset creation to fixed length record approach only. (more cleanup needed)

* more tweaks

* simplify pipeline

* fix misplaced cleanup() calls. (validation works\!)

* more tweaks

* sixify memoryview usage

* more sixification

* fix bug

* add future imports

* break up training input pipeline

* more pipeline tuning

* first pass at moving negative generation to async

* refactor async pipeline to use files instead of ipc

* refactor async pipeline

* move expansion and concatenation from reduce worker to generation workers

* abandon complete async due to interactions with the tensorflow threadpool

* cleanup

* remove performance_comparison.py

* experiment with rough generator + interleave pipeline

* yet more pipeline tuning

* update on-the-fly pipeline

* refactor preprocessing, and move train generation behind a GRPC server

* fix leftover call

* intermediate commit

* intermediate commit

* fix index error in data pipeline, and add logging to train data server

* make sharding more robust to imbalance

* correctly sample with replacement

* file buffers are no longer needed for this branch

* tweak sampling methods

* add README for data pipeline

* fix eval sampling, and vectorize eval metrics

* add spillover and static training batch sizes

* clean up cruft from earlier iterations

* rough delint

* delint 2 / n

* add type annotations

* update run script

* make run.sh a bit nicer

* change embedding initializer to match reference

* rough pass at pure estimator model_fn

* impose static shape hack (revisit later)

* refinements

* fix dir error in run.sh

* add documentation

* add more docs and fix an assert

* old data test is no longer valid. Keeping it around as reference for the new one

* rough draft of data pipeline validation script

* don't rely on shuffle default

* tweaks and documentation

* add separate eval batch size for performance

* initial commit

* terrible hacking

* mini hacks

* missed a bug

* messing about trying to get TPU running

* TFRecords based TPU attempt

* bug fixes

* don't log remotely

* more bug fixes

* TPU tweaks and bug fixes

* more tweaks

* more adjustments

* rework model definition

* tweak data pipeline

* refactor async TFRecords generation

* temp commit to run.sh

* update log behavior

* fix logging bug

* add check for subprocess start to avoid cryptic hangs

* unify deserialize and make it TPU compliant

* delint

* remove gRPC pipeline code

* fix logging bug

* delint and remove old test files

* add unit tests for NCF pipeline

* delint

* clean up run.sh, and add run_tpu.sh

* forgot the most important line

* fix run.sh bugs

* yet more bash debugging

* small tweak to add keras summaries to model_fn

* Clean up sixification issues

* address PR comments

* delinting is never over

6518c1c7

Compute metrics under distributed strategies. (#4942) · a88b89be

Sundara Tejaswi Digumarti authored Jul 30, 2018

Removed the conditional over distributed strategies when computing metrics.
Metrics are now computed even when distributed strategies are used.

a88b89be

26 Jul, 2018 1 commit

fix batch_size in transformer_main.py (#4897) · 2d7a0d6a

Jiang Yu authored Jul 25, 2018

* fix batch_size in transformer_main.py

fix batch_size in transformer_main.py which causes ResourceExhaustedError: OOM during training Transformer models using models/official/transformer

* small format change

change format from one line to multiple ones in order to pass lint tests

* remove trailing space and add comment

2d7a0d6a

21 Jul, 2018 1 commit

Use float32 metrics in mnist_eager · dfafba4a

Igor Ganichev authored Jul 20, 2018

float32 should be fine for mnist loss and accuracy metrics and float64
is not available on TPUs.

dfafba4a

20 Jul, 2018 1 commit

Add eager for keras benchmark (#4825) · 2689c9ae

Yanhui Liang authored Jul 20, 2018

* Add more arguments

* Add eager mode

* Add notes for eager mode

* Address the comments

* Fix argument typos

* Add warning for eager and multi-gpu

* Fix typo

* Fix notes

* Fix pylint

2689c9ae