Commits · 2644707cd388f5a791a04dc41fe5fdc77a55a6a4 · ModelZoo / ResNet50_tensorflow

25 Oct, 2018 1 commit
- prevent async process from writing alive file until the main process has... · 2644707c
  Taylor Robie authored Oct 25, 2018
```
prevent async process from writing alive file until the main process has created the cache root (#5614)
```
  2644707c
24 Oct, 2018 1 commit

Add logging calls to NCF (#5576) · 780f5265

Taylor Robie authored Oct 24, 2018

* first pass at __getattr__ abuse logger

* first pass at adding tags to NCF

* minor formatting updates

* fix tag name

* convert metrics to python floats

* getting closer...

* direct mlperf logs to a file

* small tweaks and add stitching

* update tags

* fix tag and add a sudo call

* tweak format of run.sh

* delint

* use distribution strategies for evaluation

* address PR comments

* delint and fix test

* adjust flag validation for xla

* add prefix to distinguish log stitching

* fix index bug

* fix clear cache for root user

* dockerize cache drop

* TIL some regex magic

780f5265

19 Oct, 2018 1 commit
- fix error when last shard is not assigned a batch (#5569) · bf298439
  Taylor Robie authored Oct 18, 2018
  
  bf298439
18 Oct, 2018 1 commit

Reorder NCF data pipeline (#5536) · 19d4eaaf

Taylor Robie authored Oct 18, 2018

* intermediate commit

finish replacing spillover with resampled padding

intermediate commit

* resolve merge conflict

* intermediate commit

* further consolidate the data pipeline

* complete first pass at data pipeline refactor

* remove some leftover code

* fix test

* remove resampling, and move train padding logic into neumf.py

* small tweaks

* fix weight bug

* address PR comments

* fix dict zip. (Reed led me astray)

* delint

* make data test deterministic and delint

* Reed didn't lead me astray. I just can't read.

* more delinting

* even more delinting

* use resampling for last batch padding

* pad last batch with unique data

* Revert "pad last batch with unique data"

This reverts commit cbdf46efcd5c7907038a24105b88d38e7f1d6da2.

* move padded batch to the beginning

* delint

* fix step check for synthetic data

19d4eaaf

14 Oct, 2018 1 commit
- Make flagfile sharing robust to distributed filesystems and multi-worker setups. (#5521) · 91b2debd
  Taylor Robie authored Oct 14, 2018
```
* move flagfile into the cache_dir

* remove duplicate code

* delint
```
  91b2debd
11 Oct, 2018 2 commits
- Add comments, exit async process after waiting for flagfile for too long and... · 1980a0da
  Shawn Wang authored Oct 11, 2018
```
Add comments, exit async process after waiting for flagfile for too long and make directory for data_dir in case it does not exist.
```
  1980a0da
- Use flagfile to pass flags to data async generation process. · c88fcb2b
  Shawn Wang authored Oct 11, 2018
  
  c88fcb2b
09 Oct, 2018 2 commits
- fixed a missing import. · a45cafb3
  Shawn Wang authored Oct 09, 2018
  
  a45cafb3
- Allow data async generation to be run as a separate job rather than as a subprocess. · 9b7e4163
  Shawn Wang authored Oct 09, 2018
  
  9b7e4163
03 Oct, 2018 1 commit

Move evaluation to .evaluate() (#5413) · c494582f

Taylor Robie authored Oct 02, 2018

* move evaluation from numpy to tensorflow

fix syntax error

don't use sigmoid to convert logits. there is too much precision loss.

WIP: add logit metrics

continue refactor of NCF evaluation

fix syntax error

fix bugs in eval loss calculation

fix eval loss reweighting

remove numpy based metric calculations

fix logging hooks

fix sigmoid to softmax bug

fix comment

catch rare PIPE error and address some PR comments

* fix metric test and address PR comments

* delint and fix python2

* fix test and address PR comments

* extend eval to TPUs

c494582f

20 Sep, 2018 1 commit

Fix/ncf mlperf tweaks: robustness and determinism (#5334) · 4dc1080d

Taylor Robie authored Sep 19, 2018

* bug fixes and add seed

* more random corrections

* make cleanup more robust

* return cleanup fn

* delint and address PR comments.

* delint and fix tests

* delinting is never done

* add pipeline hashing

* delint

4dc1080d

11 Sep, 2018 1 commit
- Fix race condition with ready file. (#5271) · 34beb7ad
  Reed authored Sep 11, 2018
  
  34beb7ad
05 Sep, 2018 1 commit

Fix crash caused by race in the async process. (#5250) · 5856878d

Reed authored Sep 05, 2018

When constructing the evaluation records, data_async_generation.py would copy the records into the final directory. The main process would wait until the eval records existed. However, the main process would sometimes read the eval records before they were fully copied, causing a DataLossError.

5856878d

22 Aug, 2018 1 commit

Fix convergence issues for MLPerf. (#5161) · 64710c05

Reed authored Aug 22, 2018

* Fix convergence issues for MLPerf.

Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function.

This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates.

Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation.

I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353.

* fix lint error

* Fix failing test

* Address @robieta's feedback

* Address more feedback

64710c05

02 Aug, 2018 1 commit

Fix bug where data_async_generation.py would freeze. (#4989) · 58037d2c

Reed authored Aug 02, 2018

The data_async_generation.py process would print to stderr, but the main process would redirect it's stderr to a pipe. The main process never read from the pipe, so when the pipe was full, data_async_generation.py would stall on a write to stderr. This change makes data_async_generation.py not write to stdout/stderr.

58037d2c

30 Jul, 2018 1 commit

NCF pipeline refactor (take 2) and initial TPU port. (#4935) · 6518c1c7

Taylor Robie authored Jul 30, 2018

* intermediate commit

* ncf now working

* reorder pipeline

* allow batched decode for file backed dataset

* fix bug

* more tweaks

* parallize false negative generation

* shared pool hack

* workers ignore sigint

* intermediate commit

* simplify buffer backed dataset creation to fixed length record approach only. (more cleanup needed)

* more tweaks

* simplify pipeline

* fix misplaced cleanup() calls. (validation works\!)

* more tweaks

* sixify memoryview usage

* more sixification

* fix bug

* add future imports

* break up training input pipeline

* more pipeline tuning

* first pass at moving negative generation to async

* refactor async pipeline to use files instead of ipc

* refactor async pipeline

* move expansion and concatenation from reduce worker to generation workers

* abandon complete async due to interactions with the tensorflow threadpool

* cleanup

* remove per...

6518c1c7