Commits · 4c11b84b1c7360aecff7c4a679d7e05076ffc19d · ModelZoo / ResNet50_tensorflow

28 Mar, 2019 1 commit

Added benchmark test and convergence test for the NCF model (#6318) · 4c11b84b

Shining Sun authored Mar 28, 2019

* initial commit

* bug fix

* Move build_stats from common to keras main, because it is only applicable in keras

* remove tailing blank line

* add test for synth data

* add kwargs to init

* add kwargs to function invokation

* correctly pass kwargs

* debug

* debug

* debug

* fix super init

* bug fix

* fix local_flags

* fix import

* bug fix

* fix log_steps flag

* bug fix

* bug fix: add missing return value

* resolve double-defined flags

* lint fix

* move log_steps flag to benchmarK flag

* fix lint

* lint fix

* lint fix

* try flag core default values

* bug fix

* bug fix

* bug fix

* debug

* debug

* remove debug prints

* rename benchmark methods

* flag bug fix for synth benchmark

4c11b84b

13 Mar, 2019 1 commit

Fix ncf test for keras (#6355) · dadc4a62

Shining Sun authored Mar 13, 2019

* Fix ncf test for keras

* add a todo for batch_size and eval_batch_size for ncf keras

* lint fix

* fix typos

* Lint fix

* fix lint

* resolve pr comment

* resolve pr comment

dadc4a62

02 Mar, 2019 1 commit
- fix resnet breakage and add keras end-to-end tests (#6295) · 8367cf6d
  Taylor Robie authored Mar 02, 2019
```
* fix resnet breakage and add keras end-to-end tests

* delint

* address PR comments
```
  8367cf6d
01 Mar, 2019 1 commit

Keras-fy NCF Model (#6092) · 048e5bff

Shining Sun authored Mar 01, 2019

* tmp commit

* tmp commit

* first attempt (without eval)

* Bug fixes

* bug fixes

* training done

* Loss NAN, no eval

* Loss weight problem solved

* resolve the NAN loss problem

* Problem solved. Clean up needed

* Added a todo

* Remove debug prints

* Extract get_optimizer to ncf_common

* Move metrics computation back to neumf; use DS.scope api

* Extract DS.scope code to utils

* lint fixes

* Move obtaining DS above producer.start to avoid race condition

* move pt 1

* move pt 2

* Update the run script

* Wrap keras_model related code into functions

* Update the doc for softmax_logitfy and change the method name

* Resolve PR comments

* working version with: eager, DS, batch and no masks

* Remove git conflict indicator

* move reshape to neumf_model

* working version, not converge

* converged

* fix a test

* more lint fix

* more lint fix

* more lint fixes

* more lint fix

* Removed unused imports

* fix test

* dummy commit for kicking of checks

* fix lint issue

* dummy input to kick off checks

* dummy input to kick off checks

* add collective to dist strat

* addressed review comments

* add a doc string

048e5bff

07 Jan, 2019 2 commits

address PR comments · 1bb074b0
Taylor Robie authored Jan 07, 2019

1bb074b0

rough pass at carving out existing NCF pipeline · c5ff4ec7

Taylor Robie authored Nov 18, 2018

2nd half of rough replacement pass

fix dataset map functions

reduce bias in sample selection

cache pandas work on a daily basis

cleanup and fix batch check for multi gpu

multi device fix

fix treatment of eval data padding

print data producer

replace epoch overlap with padding and masking

move type and shape info into the producer class and update run.sh with larger batch size hyperparams

remove xla for multi GPU

more cleanup

remove model runner altogether

bug fixes

address subtle pipeline hang and improve producer __repr__

fix crash

fix assert

use popen_helper to create pools

add StreamingFilesDataset and abstract data storage to a separate class

bug fix

fix wait bug and add manual stack trace print

more bug fixes and refactor valid point mask to work with TPU sharding

misc bug fixes and adjust dtypes

address crash from decoding bools

fix remaining dtypes and change record writer pattern since it does not append

fix synthetic data

use TPUStrategy instead of TPUEstimator

minor tweaks around moving to TPUStrategy

cleanup some old code

delint and simplify permutation generation

remove low level tf layer definition, use single table with slice for keras, and misc fixes

missed minor point on removing tf layer definition

fix several bugs from recombinging layer definitions

delint and add docstrings

Update ncf_test.py. Section for identical inputs and different outputs was removed.

update data test to run against the new producer class

c5ff4ec7

03 Nov, 2018 1 commit

Have async process end when all data is written. (#5652) · 424fe9f6

Reed authored Nov 02, 2018

I've noticed sometimes the async process's pool processes do not die when ncf_main.py ends and kills the async process. This commit fixes the issue.

424fe9f6

01 Nov, 2018 1 commit
- Add --use_while_loop option. (#5653) · 826eea75
  Reed authored Nov 01, 2018
  
  826eea75
29 Oct, 2018 1 commit
- Add option to not use estimator. (#5623) · 0c0860ed
  Reed authored Oct 29, 2018
```
The option is --nouse_estimator
```
  0c0860ed
26 Oct, 2018 1 commit

Split --ml_perf into two flags. (#5615) · 4298c3a3

Reed authored Oct 26, 2018

--ml_perf now just changes the model to make it MLPerf compliant. --output_ml_perf_compliance_logging adds the MLPerf compliance logs.

4298c3a3

03 Oct, 2018 1 commit

Move evaluation to .evaluate() (#5413) · c494582f

Taylor Robie authored Oct 02, 2018

* move evaluation from numpy to tensorflow

fix syntax error

don't use sigmoid to convert logits. there is too much precision loss.

WIP: add logit metrics

continue refactor of NCF evaluation

fix syntax error

fix bugs in eval loss calculation

fix eval loss reweighting

remove numpy based metric calculations

fix logging hooks

fix sigmoid to softmax bug

fix comment

catch rare PIPE error and address some PR comments

* fix metric test and address PR comments

* delint and fix python2

* fix test and address PR comments

* extend eval to TPUs

c494582f

22 Aug, 2018 1 commit

Fix convergence issues for MLPerf. (#5161) · 64710c05

Reed authored Aug 22, 2018

* Fix convergence issues for MLPerf.

Thank you to @robieta for helping me find these issues, and for providng an algorithm for the `get_hit_rate_and_ndcg_mlperf` function.

This change causes every forked process to set a new seed, so that forked processes do not generate the same set of random numbers. This improves evaluation hit rates.

Additionally, it adds a flag, --ml_perf, that makes further changes so that the evaluation hit rate can match the MLPerf reference implementation.

I ran 4 times with --ml_perf and 4 times without. Without --ml_perf, the highest hit rates achieved by each run were 0.6278, 0.6287, 0.6289, and 0.6241. With --ml_perf, the highest hit rates were 0.6353, 0.6356, 0.6367, and 0.6353.

* fix lint error

* Fix failing test

* Address @robieta's feedback

* Address more feedback

64710c05