Commits · fefe47ee1f557ed13fc2fbcd7d5e0b6c11e5121b · ModelZoo / ResNet50_tensorflow

07 Jan, 2019 15 commits

remove match_mlperf from expected cache keys · fefe47ee
Taylor Robie authored Jan 07, 2019

fefe47ee
Revert "remove mock to debug kokoro failures" · e0f26727
Taylor Robie authored Jan 07, 2019
```
This reverts commit 63f5827d.
```
e0f26727
remove mock to debug kokoro failures · 63f5827d
Taylor Robie authored Jan 07, 2019

63f5827d
fix lint errors · 0fbc71fc
Taylor Robie authored Jan 07, 2019

0fbc71fc
address more PR comments · 6726c5e0
Taylor Robie authored Jan 07, 2019

6726c5e0
address PR comments · 1bb074b0
Taylor Robie authored Jan 07, 2019

1bb074b0
add unbuffer to run.sh as tee is causing issues · 444f5993
Taylor Robie authored Dec 27, 2018

444f5993
skip bisection when it is not needed · 4cdea1cc
Taylor Robie authored Dec 27, 2018

4cdea1cc
remove references to hash pipeline · d569b531
Taylor Robie authored Dec 27, 2018

d569b531

Add bisection based producer for increased scalability, enable fully... · 4fb325da

Taylor Robie authored Dec 27, 2018

Add bisection based producer for increased scalability, enable fully deterministic data production, and use the materialized and bisection producer to check each other (via expected output md5's)

4fb325da

change eval_batch_size flag from a string to an int · 1048ffd5
Taylor Robie authored Dec 26, 2018

1048ffd5
address PR comments · ec0d43ba
Taylor Robie authored Dec 21, 2018

ec0d43ba
delint · c556dad9
Taylor Robie authored Dec 21, 2018

c556dad9
remove 'deterministic' · 9d42f797
Taylor Robie authored Dec 21, 2018

9d42f797

rough pass at carving out existing NCF pipeline · c5ff4ec7

Taylor Robie authored Nov 18, 2018

2nd half of rough replacement pass

fix dataset map functions

reduce bias in sample selection

cache pandas work on a daily basis

cleanup and fix batch check for multi gpu

multi device fix

fix treatment of eval data padding

print data producer

replace epoch overlap with padding and masking

move type and shape info into the producer class and update run.sh with larger batch size hyperparams

remove xla for multi GPU

more cleanup

remove model runner altogether

bug fixes

address subtle pipeline hang and improve producer __repr__

fix crash

fix assert

use popen_helper to create pools

add StreamingFilesDataset and abstract data storage to a separate class

bug fix

fix wait bug and add manual stack trace print

more bug fixes and refactor valid point mask to work with TPU sharding

misc bug fixes and adjust dtypes

address crash from decoding bools

fix remaining dtypes and change record writer pattern since it does not append

fix synthetic data

use TPUStrategy instead of TPUEstimator

minor tweaks around moving to TPUStrategy

cleanup some old code

delint and simplify permutation generation

remove low level tf layer definition, use single table with slice for keras, and misc fixes

missed minor point on removing tf layer definition

fix several bugs from recombinging layer definitions

delint and add docstrings

Update ncf_test.py. Section for identical inputs and different outputs was removed.

update data test to run against the new producer class

c5ff4ec7

20 Dec, 2018 1 commit
- Avoid using tf.contrib.data as it's not tf2-safe (#5755) · cc0ad1cb
  Alexandre Passos authored Dec 19, 2018
  
  cc0ad1cb
07 Nov, 2018 1 commit
- Fix PREPROC_HP_NUM_EVAL tag for MLPerf. (#5717) · d7ce21fa
  Reed authored Nov 07, 2018
```
This tag should match EVAL_HP_NUM_NEG.
```
  d7ce21fa
03 Nov, 2018 1 commit

Have async process end when all data is written. (#5652) · 424fe9f6

Reed authored Nov 02, 2018

I've noticed sometimes the async process's pool processes do not die when ncf_main.py ends and kills the async process. This commit fixes the issue.

424fe9f6

01 Nov, 2018 1 commit
- Add --use_while_loop option. (#5653) · 826eea75
  Reed authored Nov 01, 2018
  
  826eea75
30 Oct, 2018 3 commits
- bring NCF to l2 logging compliance (#5642) · 82e783e3
  Taylor Robie authored Oct 30, 2018
  
  82e783e3
- Keras-ify NCF TPU embedding lookup (#5641) · 8a15a4df
  Taylor Robie authored Oct 30, 2018
```
* Keras-ify TPU embedding lookup

* delint

* pull get_variable() out of keras lambda

* delint

* move get_variable under variable scope
```
  8a15a4df
- Merges TPU-TC optimizations into HEAD. (#5635) · b8318fd3
  Tayo Oguntebi authored Oct 29, 2018
```
* Merges TPU-TC optimizations into HEAD.

* Split a line that went over 80 from a tab.

* Remove trailing whitespace.
```
  b8318fd3
29 Oct, 2018 1 commit
- Add option to not use estimator. (#5623) · 0c0860ed
  Reed authored Oct 29, 2018
```
The option is --nouse_estimator
```
  0c0860ed
26 Oct, 2018 1 commit

Split --ml_perf into two flags. (#5615) · 4298c3a3

Reed authored Oct 26, 2018

--ml_perf now just changes the model to make it MLPerf compliant. --output_ml_perf_compliance_logging adds the MLPerf compliance logs.

4298c3a3

25 Oct, 2018 2 commits

prevent async process from writing alive file until the main process has... · 2644707c
Taylor Robie authored Oct 25, 2018
```
prevent async process from writing alive file until the main process has created the cache root (#5614)
```
2644707c

Fix crash when --ml_perf flag is not specified. (#5610) · 48a4b443

Reed authored Oct 25, 2018

The error message was:

absl.flags._exceptions.IllegalFlagValueError: flag --ml_perf=None: ('Non-boolean argument to boolean flag', 'None')

48a4b443

24 Oct, 2018 1 commit

Add logging calls to NCF (#5576) · 780f5265

Taylor Robie authored Oct 24, 2018

* first pass at __getattr__ abuse logger

* first pass at adding tags to NCF

* minor formatting updates

* fix tag name

* convert metrics to python floats

* getting closer...

* direct mlperf logs to a file

* small tweaks and add stitching

* update tags

* fix tag and add a sudo call

* tweak format of run.sh

* delint

* use distribution strategies for evaluation

* address PR comments

* delint and fix test

* adjust flag validation for xla

* add prefix to distinguish log stitching

* fix index bug

* fix clear cache for root user

* dockerize cache drop

* TIL some regex magic

780f5265

20 Oct, 2018 1 commit
- Add XLA support to NCF (#5572) · f2b702a0
  Reed authored Oct 19, 2018
  
  f2b702a0
19 Oct, 2018 1 commit
- fix error when last shard is not assigned a batch (#5569) · bf298439
  Taylor Robie authored Oct 18, 2018
  
  bf298439
18 Oct, 2018 2 commits

Reorder NCF data pipeline (#5536) · 19d4eaaf

Taylor Robie authored Oct 18, 2018

* intermediate commit

finish replacing spillover with resampled padding

intermediate commit

* resolve merge conflict

* intermediate commit

* further consolidate the data pipeline

* complete first pass at data pipeline refactor

* remove some leftover code

* fix test

* remove resampling, and move train padding logic into neumf.py

* small tweaks

* fix weight bug

* address PR comments

* fix dict zip. (Reed led me astray)

* delint

* make data test deterministic and delint

* Reed didn't lead me astray. I just can't read.

* more delinting

* even more delinting

* use resampling for last batch padding

* pad last batch with unique data

* Revert "pad last batch with unique data"

This reverts commit cbdf46efcd5c7907038a24105b88d38e7f1d6da2.

* move padded batch to the beginning

* delint

* fix step check for synthetic data

19d4eaaf

Delint. · 3ec25e5d
Shawn Wang authored Oct 17, 2018

3ec25e5d

17 Oct, 2018 2 commits
- Fix a few imports. · f9742f43
  Shawn Wang authored Oct 17, 2018
  
  f9742f43
- Refactor neumf_model.py to support users who just need top_k and ndcg tensors. · 91000bc5
  Shawn Wang authored Oct 17, 2018
  
  91000bc5
14 Oct, 2018 1 commit
- Make flagfile sharing robust to distributed filesystems and multi-worker setups. (#5521) · 91b2debd
  Taylor Robie authored Oct 14, 2018
```
* move flagfile into the cache_dir

* remove duplicate code

* delint
```
  91b2debd
13 Oct, 2018 1 commit

Replace multiprocess pool with popen_helper.get_pool() in data_preprocessing. (#5512) · 0c5c3a77

shizhiw authored Oct 12, 2018

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Replace multiprocess pool with popen_helper.get_pool() in data_preprocessing.

0c5c3a77

11 Oct, 2018 5 commits
- Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py. (#5506) · b88da6ee
  shizhiw authored Oct 11, 2018
```
* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.

* Use data_dir instead of flags.FLAGS.data_dir in data_preprocessing.py.
```
  b88da6ee
- Add comments, exit async process after waiting for flagfile for too long and... · 1980a0da
  Shawn Wang authored Oct 11, 2018
```
Add comments, exit async process after waiting for flagfile for too long and make directory for data_dir in case it does not exist.
```
  1980a0da
- Use flagfile to pass flags to data async generation process: small fix. · 5d497296
  Shawn Wang authored Oct 11, 2018
  
  5d497296
- Use flagfile to pass flags to data async generation process. · c88fcb2b
  Shawn Wang authored Oct 11, 2018
  
  c88fcb2b
- Added option to use_subprocess or not in ncf_main.py. · d4ac494f
  Shawn Wang authored Oct 11, 2018
  
  d4ac494f