Replace pipeline in NCF (#5786)
* rough pass at carving out existing NCF pipeline 2nd half of rough replacement pass fix dataset map functions reduce bias in sample selection cache pandas work on a daily basis cleanup and fix batch check for multi gpu multi device fix fix treatment of eval data padding print data producer replace epoch overlap with padding and masking move type and shape info into the producer class and update run.sh with larger batch size hyperparams remove xla for multi GPU more cleanup remove model runner altogether bug fixes address subtle pipeline hang and improve producer __repr__ fix crash fix assert use popen_helper to create pools add StreamingFilesDataset and abstract data storage to a separate class bug fix fix wait bug and add manual stack trace print more bug fixes and refactor valid point mask to work with TPU sharding misc bug fixes and adjust dtypes address crash from decoding bools fix remaining dtypes and change record writer pattern since it does not append fix synthetic data use TPUStrategy instead of TPUEstimator minor tweaks around moving to TPUStrategy cleanup some old code delint and simplify permutation generation remove low level tf layer definition, use single table with slice for keras, and misc fixes missed minor point on removing tf layer definition fix several bugs from recombinging layer definitions delint and add docstrings Update ncf_test.py. Section for identical inputs and different outputs was removed. update data test to run against the new producer class * remove 'deterministic' * delint * address PR comments * change eval_batch_size flag from a string to an int * Add bisection based producer for increased scalability, enable fully deterministic data production, and use the materialized and bisection producer to check each other (via expected output md5's) * remove references to hash pipeline * skip bisection when it is not needed * add unbuffer to run.sh as tee is causing issues * address PR comments * address more PR comments * fix lint errors * trim lines in resnet keras * remove mock to debug kokoro failures * Revert "remove mock to debug kokoro failures" This reverts commit 63f5827d. * remove match_mlperf from expected cache keys * fix test now that cache construction no longer uses match_mlperf * disable tests to debug test failure * disable more tests * completely disable data_test * restore data test * add versions to requirements.txt * update call to TPUStrategy
Showing
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Please register or sign in to comment