Commits · cc0ad1cb9f4c81373dd52ebf9cf87eeda81eaba5 · ModelZoo / ResNet50_tensorflow

20 Dec, 2018 1 commit
- Avoid using tf.contrib.data as it's not tf2-safe (#5755) · cc0ad1cb
  Alexandre Passos authored Dec 19, 2018
  
  cc0ad1cb
07 Nov, 2018 1 commit
- Fix PREPROC_HP_NUM_EVAL tag for MLPerf. (#5717) · d7ce21fa
  Reed authored Nov 07, 2018
```
This tag should match EVAL_HP_NUM_NEG.
```
  d7ce21fa
03 Nov, 2018 1 commit

Have async process end when all data is written. (#5652) · 424fe9f6

Reed authored Nov 02, 2018

I've noticed sometimes the async process's pool processes do not die when ncf_main.py ends and kills the async process. This commit fixes the issue.

424fe9f6

01 Nov, 2018 1 commit
- Add --use_while_loop option. (#5653) · 826eea75
  Reed authored Nov 01, 2018
  
  826eea75
30 Oct, 2018 3 commits
- bring NCF to l2 logging compliance (#5642) · 82e783e3
  Taylor Robie authored Oct 30, 2018
  
  82e783e3
- Keras-ify NCF TPU embedding lookup (#5641) · 8a15a4df
  Taylor Robie authored Oct 30, 2018
```
* Keras-ify TPU embedding lookup

* delint

* pull get_variable() out of keras lambda

* delint

* move get_variable under variable scope
```
  8a15a4df
- Merges TPU-TC optimizations into HEAD. (#5635) · b8318fd3
  Tayo Oguntebi authored Oct 29, 2018
```
* Merges TPU-TC optimizations into HEAD.

* Split a line that went over 80 from a tab.

* Remove trailing whitespace.
```
  b8318fd3
29 Oct, 2018 1 commit
- Add option to not use estimator. (#5623) · 0c0860ed
  Reed authored Oct 29, 2018
```
The option is --nouse_estimator
```
  0c0860ed
26 Oct, 2018 1 commit

Split --ml_perf into two flags. (#5615) · 4298c3a3

Reed authored Oct 26, 2018

--ml_perf now just changes the model to make it MLPerf compliant. --output_ml_perf_compliance_logging adds the MLPerf compliance logs.

4298c3a3

25 Oct, 2018 2 commits

prevent async process from writing alive file until the main process has... · 2644707c
Taylor Robie authored Oct 25, 2018
```
prevent async process from writing alive file until the main process has created the cache root (#5614)
```
2644707c

Fix crash when --ml_perf flag is not specified. (#5610) · 48a4b443

Reed authored Oct 25, 2018

The error message was:

absl.flags._exceptions.IllegalFlagValueError: flag --ml_perf=None: ('Non-boolean argument to boolean flag', 'None')

48a4b443

24 Oct, 2018 1 commit

Add logging calls to NCF (#5576) · 780f5265

Taylor Robie authored Oct 24, 2018

* first pass at __getattr__ abuse logger

* first pass at adding tags to NCF

* minor formatting updates

* fix tag name

* convert metrics to python floats

* getting closer...

* direct mlperf logs to a file

* small tweaks and add stitching

* update tags

* fix tag and add a sudo call

* tweak format of run.sh

* delint

* use distribution strategies for evaluation

* address PR comments

* delint and fix test

* adjust flag validation for xla

* add prefix to distinguish log stitching

* fix index bug

* fix clear cache for root user

* dockerize cache drop

* TIL some regex magic

780f5265

20 Oct, 2018 1 commit
- Add XLA support to NCF (#5572) · f2b702a0
  Reed authored Oct 19, 2018
  
  f2b702a0
19 Oct, 2018 1 commit
- fix error when last shard is not assigned a batch (#5569) · bf298439
  Taylor Robie authored Oct 18, 2018
  
  bf298439
18 Oct, 2018 2 commits

Reorder NCF data pipeline (#5536) · 19d4eaaf

Taylor Robie authored Oct 18, 2018

* intermediate commit

finish replacing spillover with resampled padding

intermediate commit

* resolve merge conflict

* intermediate commit

* further consolidate the data pipeline

* complete first pass at data pipeline refactor

* remove some leftover code

* fix test

* remove resampling, and move train padding logic into neumf.py

* small tweaks

* fix weight bug

* address PR comments

* fix dict zip. (Reed led me astray)

* delint

* make data test deterministic and delint

* Reed didn't lead me astray. I just can't read.

* more delinting

* even more delinting

* use resampling for last batch padding

* pad last batch with unique data

* Revert "pad last batch with unique data"

This reverts commit cbdf46efcd5c7907038a24105b88d38e7f1d6da2.

* move padded batch to the beginning

* delint

* fix step check for synthetic data

19d4eaaf

Delint. · 3ec25e5d
Shawn Wang authored Oct 17, 2018

3ec25e5d

17 Oct, 2018 2 commits
- Fix a few imports. · f9742f43
  Shawn Wang authored Oct 17, 2018
  
  f9742f43
- Refactor neumf_model.py to support users who just need top_k and ndcg tensors. · 91000bc5
  Shawn Wang authored Oct 17, 2018
  
  91000bc5
14 Oct, 2018 1 commit
- Make flagfile sharing robust to distributed filesystems and multi-worker setups. (#5521) · 91b2debd
  Taylor Robie authored Oct 14, 2018
```
* move flagfile into the cache_dir

* remove duplicate code

* delint
```
  91b2debd
13 Oct, 2018 1 commit

Replace multiprocess pool with popen_helper.get_pool() in data_preprocessing. (#5512) · 0c5c3a77

shizhiw authored Oct 12, 2018

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Replace multiprocess pool with popen_helper.get_pool() in data_preprocessing.

0c5c3a77

11 Oct, 2018 5 commits
- Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py. (#5506) · b88da6ee
  shizhiw authored Oct 11, 2018
```
* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.
```
  b88da6ee
- Add comments, exit async process after waiting for flagfile for too long and... · 1980a0da
  Shawn Wang authored Oct 11, 2018
```
Add comments, exit async process after waiting for flagfile for too long and make directory for data_dir in case it does not exist.
```
  1980a0da
- Use flagfile to pass flags to data async generation process: small fix. · 5d497296
  Shawn Wang authored Oct 11, 2018
  
  5d497296
- Use flagfile to pass flags to data async generation process. · c88fcb2b
  Shawn Wang authored Oct 11, 2018
  
  c88fcb2b
- Added option to use_subprocess or not in ncf_main.py. · d4ac494f
  Shawn Wang authored Oct 11, 2018
  
  d4ac494f
10 Oct, 2018 2 commits
- Improve perf by converting sparse grads to dense. (#5470) · ad254209
  Reed authored Oct 10, 2018
  
  ad254209
- Add --use_synthetic_data option to NCF. (#5468) · 75d592e9
  Reed authored Oct 10, 2018
```
* Add --use_synthetic_data option to NCF.

* Add comment to _SYNTHETIC_BATCHES_PER_EPOCH

* Fix test

* Hopefully fix lint issue
```
  75d592e9
09 Oct, 2018 2 commits
- fixed a missing import. · a45cafb3
  Shawn Wang authored Oct 09, 2018
  
  a45cafb3
- Allow data async generation to be run as a separate job rather than as a subprocess. · 9b7e4163
  Shawn Wang authored Oct 09, 2018
  
  9b7e4163
05 Oct, 2018 1 commit

Fix/ncf eval default (#5438) · aec1fec6

Taylor Robie authored Oct 04, 2018

* improve default handling for eval_batch_size

* return eval_batch_size default to None

* fix syntax error

aec1fec6

03 Oct, 2018 1 commit

Move evaluation to .evaluate() (#5413) · c494582f

Taylor Robie authored Oct 02, 2018

* move evaluation from numpy to tensorflow

fix syntax error

don't use sigmoid to convert logits. there is too much precision loss.

WIP: add logit metrics

continue refactor of NCF evaluation

fix syntax error

fix bugs in eval loss calculation

fix eval loss reweighting

remove numpy based metric calculations

fix logging hooks

fix sigmoid to softmax bug

fix comment

catch rare PIPE error and address some PR comments

* fix metric test and address PR comments

* delint and fix python2

* fix test and address PR comments

* extend eval to TPUs

c494582f

02 Oct, 2018 1 commit
- Add flags for adam hyperparameters (#5428) · f3be93a7
  Reed authored Oct 02, 2018
  
  f3be93a7
20 Sep, 2018 1 commit

Fix/ncf mlperf tweaks: robustness and determinism (#5334) · 4dc1080d

Taylor Robie authored Sep 19, 2018

* bug fixes and add seed

* more random corrections

* make cleanup more robust

* return cleanup fn

* delint and address PR comments.

* delint and fix tests

* delinting is never done

* add pipeline hashing

* delint

4dc1080d

14 Sep, 2018 1 commit

Wait longer for async process to spawn. (#5307) · 17fa5286

Reed authored Sep 13, 2018

Sometimes it takes longer than 15 seconds, and even longer than 1 minute, to spawn and create the alive file.

17fa5286

11 Sep, 2018 1 commit
- Fix race condition with ready file. (#5271) · 34beb7ad
  Reed authored Sep 11, 2018
  
  34beb7ad
05 Sep, 2018 2 commits

Fix spurious "did not start correctly" error. (#5252) · 7babedc5

Reed authored Sep 05, 2018

* Fix spurious "did not start correctly" error.

The error "Generation subprocess did not start correctly" would occur if the async process started up after the main process checked for the subproc_alive file.

* Add error message

7babedc5

Fix crash caused by race in the async process. (#5250) · 5856878d

Reed authored Sep 05, 2018

When constructing the evaluation records, data_async_generation.py would copy the records into the final directory. The main process would wait until the eval records existed. However, the main process would sometimes read the eval records before they were fully copied, causing a DataLossError.

5856878d

22 Aug, 2018 1 commit

Fix convergence issues for MLPerf. (#5161) · 64710c05

Reed authored Aug 22, 2018

* Fix convergence issues for MLPerf.

Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function.

This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates.

Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation.

I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353.

* fix lint error

* Fix failing test

* Address @robieta's feedback

* Address more feedback

64710c05

18 Aug, 2018 1 commit

Speed up cache construction. (#5131) · 5aee67b4

Reed authored Aug 17, 2018

This is done by using a higher Pickle protocol version, which the Python docs describe as being "slightly more efficient". This reduces the file write time at the beginning from 2 1/2 minutes to 5 seconds.

5aee67b4

02 Aug, 2018 1 commit
- Fix docstrings in data_preprocessing.py. (#4976) · f0e10716
  Reed authored Aug 02, 2018
  
  f0e10716