Replace pipeline in NCF (#5786)

* rough pass at carving out existing NCF pipeline 2nd half of rough replacement pass fix dataset map functions reduce bias in sample selection cache pandas work on a daily basis cleanup and fix batch check for multi gpu multi device fix fix treatment of eval data padding print data producer replace epoch overlap with padding and masking move type and shape info into the producer class and update run.sh with larger batch size hyperparams remove xla for multi GPU more cleanup remove model runner altogether bug fixes address subtle pipeline hang and improve producer __repr__ fix crash fix assert use popen_helper to create pools add StreamingFilesDataset and abstract data storage to a separate class bug fix fix wait bug and add manual stack trace print more bug fixes and refactor valid point mask to work with TPU sharding misc bug fixes and adjust dtypes address crash from decoding bools fix remaining dtypes and change record writer pattern since it does not append fix synthetic data use TPUStrategy instead of TPUEstimator minor tweaks around moving to TPUStrategy cleanup some old code delint and simplify permutation generation remove low level tf layer definition, use single table with slice for keras, and misc fixes missed minor point on removing tf layer definition fix several bugs from recombinging layer definitions delint and add docstrings Update ncf_test.py. Section for identical inputs and different outputs was removed. update data test to run against the new producer class * remove 'deterministic' * delint * address PR comments * change eval_batch_size flag from a string to an int * Add bisection based producer for increased scalability, enable fully deterministic data production, and use the materialized and bisection producer to check each other (via expected output md5's) * remove references to hash pipeline * skip bisection when it is not needed * add unbuffer to run.sh as tee is causing issues * address PR comments * address more PR comments * fix lint errors * trim lines in resnet keras * remove mock to debug kokoro failures * Revert "remove mock to debug kokoro failures" This reverts commit 63f5827d. * remove match_mlperf from expected cache keys * fix test now that cache construction no longer uses match_mlperf * disable tests to debug test failure * disable more tests * completely disable data_test * restore data test * add versions to requirements.txt * update call to TPUStrategy

Replace pipeline in NCF (#5786)
* rough pass at carving out existing NCF pipeline 2nd half of rough replacement pass fix dataset map functions reduce bias in sample selection cache pandas work on a daily basis cleanup and fix batch check for multi gpu multi device fix fix treatment of eval data padding print data producer replace epoch overlap with padding and masking move type and shape info into the producer class and update run.sh with larger batch size hyperparams remove xla for multi GPU more cleanup remove model runner altogether bug fixes address subtle pipeline hang and improve producer __repr__ fix crash fix assert use popen_helper to create pools add StreamingFilesDataset and abstract data storage to a separate class bug fix fix wait bug and add manual stack trace print more bug fixes and refactor valid point mask to work with TPU sharding misc bug fixes and adjust dtypes address crash from decoding bools fix remaining dtypes and change record writer pattern since it does not append fix synthetic data use TPUStrategy instead of TPUEstimator minor tweaks around moving to TPUStrategy cleanup some old code delint and simplify permutation generation remove low level tf layer definition, use single table with slice for keras, and misc fixes missed minor point on removing tf layer definition fix several bugs from recombinging layer definitions delint and add docstrings Update ncf_test.py. Section for identical inputs and different outputs was removed. update data test to run against the new producer class * remove 'deterministic' * delint * address PR comments * change eval_batch_size flag from a string to an int * Add bisection based producer for increased scalability, enable fully deterministic data production, and use the materialized and bisection producer to check each other (via expected output md5's) * remove references to hash pipeline * skip bisection when it is not needed * add unbuffer to run.sh as tee is causing issues * address PR comments * address more PR comments * fix lint errors * trim lines in resnet keras * remove mock to debug kokoro failures * Revert "remove mock to debug kokoro failures" This reverts commit 63f5827d. * remove match_mlperf from expected cache keys * fix test now that cache construction no longer uses match_mlperf * disable tests to debug test failure * disable more tests * completely disable data_test * restore data test * add versions to requirements.txt * update call to TPUStrategy
56cbd1f2 · Taylor Robie · GitHub · 2c4dc0c0 · 7021ac1c · 56cbd1f2
Unverified Commit 56cbd1f2 authored Jan 08, 2019 by Taylor Robie Committed by GitHub Jan 08, 2019
16 changed files
--- a/official/recommendation/constants.py
+++ b/official/recommendation/constants.py
@@ -14,36 +14,30 @@
 # ==============================================================================
 """Central location for NCF specific values."""

-import os
-import time
+import sys

+import numpy as np
+
+from official.datasets import movielens

 # ==============================================================================
 # == Main Thread Data Processing ===============================================
 # ==============================================================================
-class Paths(object):
-  """Container for various path information used while training NCF."""
-
-  def __init__(self, data_dir, cache_id=None):
-    self.cache_id = cache_id or int(time.time())
-    self.data_dir = data_dir
-    self.cache_root = os.path.join(
-        self.data_dir, "{}_ncf_recommendation_cache".format(self.cache_id))
-    self.train_shard_subdir = os.path.join(self.cache_root,
-                                           "raw_training_shards")
-    self.train_shard_template = os.path.join(self.train_shard_subdir,
-                                             "positive_shard_{}.pickle")
-    self.train_epoch_dir = os.path.join(self.cache_root, "training_epochs")
-    self.eval_data_subdir = os.path.join(self.cache_root, "eval_data")

-    self.subproc_alive = os.path.join(self.cache_root, "subproc.alive")
+# Keys for data shards
+TRAIN_USER_KEY = "train_{}".format(movielens.USER_COLUMN)
+TRAIN_ITEM_KEY = "train_{}".format(movielens.ITEM_COLUMN)
+TRAIN_LABEL_KEY = "train_labels"
+MASK_START_INDEX = "mask_start_index"
+VALID_POINT_MASK = "valid_point_mask"
+EVAL_USER_KEY = "eval_{}".format(movielens.USER_COLUMN)
+EVAL_ITEM_KEY = "eval_{}".format(movielens.ITEM_COLUMN)

+USER_MAP = "user_map"
+ITEM_MAP = "item_map"

-APPROX_PTS_PER_TRAIN_SHARD = 128000
-
-# Keys for data shards
-TRAIN_KEY = "train"
-EVAL_KEY = "eval"
+USER_DTYPE = np.int32
+ITEM_DTYPE = np.int32

 # In both datasets, each user has at least 20 ratings.
 MIN_NUM_RATINGS = 20
@@ -62,21 +56,24 @@ DUPLICATE_MASK = "duplicate_mask"
 HR_METRIC_NAME = "HR_METRIC"
 NDCG_METRIC_NAME = "NDCG_METRIC"

+# Trying to load a cache created in py2 when running in py3 will cause an
+# error due to differences in unicode handling.
+RAW_CACHE_FILE = "raw_data_cache_py{}.pickle".format(sys.version_info[0])
+CACHE_INVALIDATION_SEC = 3600 * 24
+
 # ==============================================================================
-# == Subprocess Data Generation ================================================
+# == Data Generation ===========================================================
 # ==============================================================================
 CYCLES_TO_BUFFER = 3  # The number of train cycles worth of data to "run ahead"
                      # of the main training loop.

-FLAGFILE_TEMP = "flagfile.temp"
-FLAGFILE = "flagfile"
-READY_FILE_TEMP = "ready.json.temp"
-READY_FILE = "ready.json"
-
-TRAIN_RECORD_TEMPLATE = "train_{}.tfrecords"
-EVAL_RECORD_TEMPLATE = "eval_{}.tfrecords"
+# Number of batches to run per epoch when using synthetic data. At high batch
+# sizes, we run for more batches than with real data, which is good since
+# running more batches reduces noise when measuring the average batches/second.
+SYNTHETIC_BATCHES_PER_EPOCH = 2000

-TIMEOUT_SECONDS = 3600 * 2  # If the train loop goes more than two hours without
-                            # consuming an epoch of data, this is a good
-                            # indicator that the main thread is dead and the
-                            # subprocess is orphaned.
+# Only used when StreamingFilesDataset is used.
+NUM_FILE_SHARDS = 16
+TRAIN_FOLDER_TEMPLATE = "training_cycle_{}"
+EVAL_FOLDER = "eval_data"
+SHARD_TEMPLATE = "shard_{}.tfrecords"
--- a/official/recommendation/data_async_generation.py
+++ b/official/recommendation/data_async_generation.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Asynchronously generate TFRecords files for NCF."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import atexit
-import contextlib
-import datetime
-import gc
-import multiprocessing
-import json
-import os
-import pickle
-import signal
-import sys
-import tempfile
-import time
-import timeit
-import traceback
-import typing
-
-import numpy as np
-import tensorflow as tf
-
-from absl import app as absl_app
-from absl import flags
-
-from official.datasets import movielens
-from official.recommendation import constants as rconst
-from official.recommendation import stat_utils
-from official.recommendation import popen_helper
-from official.utils.logs import mlperf_helper
-
-
-_log_file = None
-
-
-def log_msg(msg):
-  """Include timestamp info when logging messages to a file."""
-  if flags.FLAGS.use_tf_logging:
-    tf.logging.info(msg)
-    return
-
-  if flags.FLAGS.redirect_logs:
-    timestamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
-    print("[{}] {}".format(timestamp, msg), file=_log_file)
-  else:
-    print(msg, file=_log_file)
-  if _log_file:
-    _log_file.flush()
-
-
-def get_cycle_folder_name(i):
-  return "cycle_{}".format(str(i).zfill(5))
-
-
-def _process_shard(args):
-  # type: ((str, int, int, int, bool)) -> (np.ndarray, np.ndarray, np.ndarray)
-  """Read a shard of training data and return training vectors.
-
-  Args:
-    shard_path: The filepath of the positive instance training shard.
-    num_items: The cardinality of the item set.
-    num_neg: The number of negatives to generate per positive example.
-    seed: Random seed to be used when generating negatives.
-    is_training: Generate training (True) or eval (False) data.
-    match_mlperf: Match the MLPerf reference behavior
-  """
-  shard_path, num_items, num_neg, seed, is_training, match_mlperf = args
-  np.random.seed(seed)
-
-  # The choice to store the training shards in files rather than in memory
-  # is motivated by the fact that multiprocessing serializes arguments,
-  # transmits them to map workers, and then deserializes them. By storing the
-  # training shards in files, the serialization work only needs to be done once.
-  #
-  # A similar effect could be achieved by simply holding pickled bytes in
-  # memory, however the processing is not I/O bound and is therefore
-  # unnecessary.
-  with tf.gfile.Open(shard_path, "rb") as f:
-    shard = pickle.load(f)
-
-  users = shard[rconst.TRAIN_KEY][movielens.USER_COLUMN]
-  items = shard[rconst.TRAIN_KEY][movielens.ITEM_COLUMN]
-
-  if not is_training:
-    # For eval, there is one positive which was held out from the training set.
-    test_positive_dict = dict(zip(
-        shard[rconst.EVAL_KEY][movielens.USER_COLUMN],
-        shard[rconst.EVAL_KEY][movielens.ITEM_COLUMN]))
-
-  delta = users[1:] - users[:-1]
-  boundaries = ([0] + (np.argwhere(delta)[:, 0] + 1).tolist() +
-                [users.shape[0]])
-
-  user_blocks = []
-  item_blocks = []
-  label_blocks = []
-  for i in range(len(boundaries) - 1):
-    assert len(set(users[boundaries[i]:boundaries[i+1]])) == 1
-    current_user = users[boundaries[i]]
-
-    positive_items = items[boundaries[i]:boundaries[i+1]]
-    positive_set = set(positive_items)
-    if positive_items.shape[0] != len(positive_set):
-      raise ValueError("Duplicate entries detected.")
-
-    if is_training:
-      n_pos = len(positive_set)
-      negatives = stat_utils.sample_with_exclusion(
-          num_items, positive_set, n_pos * num_neg, replacement=True)
-
-    else:
-      if not match_mlperf:
-        # The mlperf reference allows the holdout item to appear as a negative.
-        # Including it in the positive set makes the eval more stringent,
-        # because an appearance of the test item would be removed by
-        # deduplication rules. (Effectively resulting in a minute reduction of
-        # NUM_EVAL_NEGATIVES)
-        positive_set.add(test_positive_dict[current_user])
-
-      negatives = stat_utils.sample_with_exclusion(
-          num_items, positive_set, num_neg, replacement=match_mlperf)
-      positive_set = [test_positive_dict[current_user]]
-      n_pos = len(positive_set)
-      assert n_pos == 1
-
-    user_blocks.append(current_user * np.ones(
-        (n_pos * (1 + num_neg),), dtype=np.int32))
-    item_blocks.append(
-        np.array(list(positive_set) + negatives, dtype=np.uint16))
-    labels_for_user = np.zeros((n_pos * (1 + num_neg),), dtype=np.int8)
-    labels_for_user[:n_pos] = 1
-    label_blocks.append(labels_for_user)
-
-  users_out = np.concatenate(user_blocks)
-  items_out = np.concatenate(item_blocks)
-  labels_out = np.concatenate(label_blocks)
-
-  assert users_out.shape == items_out.shape == labels_out.shape
-  return users_out, items_out, labels_out
-
-
-def _construct_record(users, items, labels=None, dupe_mask=None):
-  """Convert NumPy arrays into a TFRecords entry."""
-  feature_dict = {
-      movielens.USER_COLUMN: tf.train.Feature(
-          bytes_list=tf.train.BytesList(value=[memoryview(users).tobytes()])),
-      movielens.ITEM_COLUMN: tf.train.Feature(
-          bytes_list=tf.train.BytesList(value=[memoryview(items).tobytes()])),
-  }
-  if labels is not None:
-    feature_dict["labels"] = tf.train.Feature(
-        bytes_list=tf.train.BytesList(value=[memoryview(labels).tobytes()]))
-
-  if dupe_mask is not None:
-    feature_dict[rconst.DUPLICATE_MASK] = tf.train.Feature(
-        bytes_list=tf.train.BytesList(value=[memoryview(dupe_mask).tobytes()]))
-
-  return tf.train.Example(
-      features=tf.train.Features(feature=feature_dict)).SerializeToString()
-
-
-def sigint_handler(signal, frame):
-  log_msg("Shutting down worker.")
-
-
-def init_worker():
-  signal.signal(signal.SIGINT, sigint_handler)
-
-
-def _construct_records(
-    is_training,          # type: bool
-    train_cycle,          # type: typing.Optional[int]
-    num_workers,          # type: int
-    cache_paths,          # type: rconst.Paths
-    num_readers,          # type: int
-    num_neg,              # type: int
-    num_positives,        # type: int
-    num_items,            # type: int
-    epochs_per_cycle,     # type: int
-    batch_size,           # type: int
-    training_shards,      # type: typing.List[str]
-    deterministic=False,  # type: bool
-    match_mlperf=False    # type: bool
-    ):
-  """Generate false negatives and write TFRecords files.
-
-  Args:
-    is_training: Are training records (True) or eval records (False) created.
-    train_cycle: Integer of which cycle the generated data is for.
-    num_workers: Number of multiprocessing workers to use for negative
-      generation.
-    cache_paths: Paths object with information of where to write files.
-    num_readers: The number of reader datasets in the input_fn. This number is
-      approximate; fewer shards will be created if not all shards are assigned
-      batches. This can occur due to discretization in the assignment process.
-    num_neg: The number of false negatives per positive example.
-    num_positives: The number of positive examples. This value is used
-      to pre-allocate arrays while the imap is still running. (NumPy does not
-      allow dynamic arrays.)
-    num_items: The cardinality of the item set.
-    epochs_per_cycle: The number of epochs worth of data to construct.
-    batch_size: The expected batch size used during training. This is used
-      to properly batch data when writing TFRecords.
-    training_shards: The picked positive examples from which to generate
-      negatives.
-  """
-  st = timeit.default_timer()
-
-  if is_training:
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_STEP_TRAIN_NEG_GEN)
-    mlperf_helper.ncf_print(
-        key=mlperf_helper.TAGS.INPUT_HP_NUM_NEG, value=num_neg)
-
-    # set inside _process_shard()
-    mlperf_helper.ncf_print(
-        key=mlperf_helper.TAGS.INPUT_HP_SAMPLE_TRAIN_REPLACEMENT, value=True)
-
-  else:
-    # Later logic assumes that all items for a given user are in the same batch.
-    assert not batch_size % (rconst.NUM_EVAL_NEGATIVES + 1)
-    assert num_neg == rconst.NUM_EVAL_NEGATIVES
-
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_STEP_EVAL_NEG_GEN)
-
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_HP_NUM_USERS,
-                            value=num_positives)
-
-  assert epochs_per_cycle == 1 or is_training
-  num_workers = min([num_workers, len(training_shards) * epochs_per_cycle])
-
-  num_pts = num_positives * (1 + num_neg)
-
-  # Equivalent to `int(ceil(num_pts / batch_size)) * batch_size`, but without
-  # precision concerns
-  num_pts_with_padding = (num_pts + batch_size - 1) // batch_size * batch_size
-  num_padding = num_pts_with_padding - num_pts
-
-  # We choose a different random seed for each process, so that the processes
-  # will not all choose the same random numbers.
-  process_seeds = [stat_utils.random_int32()
-                   for _ in training_shards * epochs_per_cycle]
-  map_args = [
-      (shard, num_items, num_neg, process_seeds[i], is_training, match_mlperf)
-      for i, shard in enumerate(training_shards * epochs_per_cycle)]
-
-  with popen_helper.get_pool(num_workers, init_worker) as pool:
-    map_fn = pool.imap if deterministic else pool.imap_unordered  # pylint: disable=no-member
-    data_generator = map_fn(_process_shard, map_args)
-    data = [
-        np.zeros(shape=(num_pts_with_padding,), dtype=np.int32) - 1,
-        np.zeros(shape=(num_pts_with_padding,), dtype=np.uint16),
-        np.zeros(shape=(num_pts_with_padding,), dtype=np.int8),
-    ]
-
-    # Training data is shuffled. Evaluation data MUST not be shuffled.
-    # Downstream processing depends on the fact that evaluation data for a given
-    # user is grouped within a batch.
-    if is_training:
-      index_destinations = np.random.permutation(num_pts)
-      mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_ORDER)
-    else:
-      index_destinations = np.arange(num_pts)
-
-    start_ind = 0
-    for data_segment in data_generator:
-      n_in_segment = data_segment[0].shape[0]
-      dest = index_destinations[start_ind:start_ind + n_in_segment]
-      start_ind += n_in_segment
-      for i in range(3):
-        data[i][dest] = data_segment[i]
-
-  assert np.sum(data[0] == -1) == num_padding
-
-  if is_training:
-    if num_padding:
-      # In order to have a full batch, randomly include points from earlier in
-      # the batch.
-
-      mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_ORDER)
-      pad_sample_indices = np.random.randint(
-          low=0, high=num_pts, size=(num_padding,))
-      dest = np.arange(start=start_ind, stop=start_ind + num_padding)
-      start_ind += num_padding
-      for i in range(3):
-        data[i][dest] = data[i][pad_sample_indices]
-  else:
-    # For Evaluation, padding is all zeros. The evaluation input_fn knows how
-    # to interpret and discard the zero padded entries.
-    data[0][num_pts:] = 0
-
-  # Check that no points were overlooked.
-  assert not np.sum(data[0] == -1)
-
-  if is_training:
-    # The number of points is slightly larger than num_pts due to padding.
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_SIZE,
-                            value=int(data[0].shape[0]))
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_BATCH_SIZE,
-                            value=batch_size)
-  else:
-    # num_pts is logged instead of int(data[0].shape[0]), because the size
-    # of the data vector includes zero pads which are ignored.
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_SIZE, value=num_pts)
-
-  batches_per_file = np.ceil(num_pts_with_padding / batch_size / num_readers)
-  current_file_id = -1
-  current_batch_id = -1
-  batches_by_file = [[] for _ in range(num_readers)]
-
-  while True:
-    current_batch_id += 1
-    if (current_batch_id % batches_per_file) == 0:
-      current_file_id += 1
-
-    start_ind = current_batch_id * batch_size
-    end_ind = start_ind + batch_size
-    if end_ind > num_pts_with_padding:
-      if start_ind != num_pts_with_padding:
-        raise ValueError("Batch padding does not line up")
-      break
-    batches_by_file[current_file_id].append(current_batch_id)
-
-  # Drop shards which were not assigned batches
-  batches_by_file = [i for i in batches_by_file if i]
-  num_readers = len(batches_by_file)
-
-  if is_training:
-    # Empirically it is observed that placing the batch with repeated values at
-    # the start rather than the end improves convergence.
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.INPUT_ORDER)
-    batches_by_file[0][0], batches_by_file[-1][-1] = \
-      batches_by_file[-1][-1], batches_by_file[0][0]
-
-  if is_training:
-    template = rconst.TRAIN_RECORD_TEMPLATE
-    record_dir = os.path.join(cache_paths.train_epoch_dir,
-                              get_cycle_folder_name(train_cycle))
-    tf.gfile.MakeDirs(record_dir)
-  else:
-    template = rconst.EVAL_RECORD_TEMPLATE
-    record_dir = cache_paths.eval_data_subdir
-
-  batch_count = 0
-  for i in range(num_readers):
-    fpath = os.path.join(record_dir, template.format(i))
-    log_msg("Writing {}".format(fpath))
-    with tf.python_io.TFRecordWriter(fpath) as writer:
-      for j in batches_by_file[i]:
-        start_ind = j * batch_size
-        end_ind = start_ind + batch_size
-        record_kwargs = dict(
-            users=data[0][start_ind:end_ind],
-            items=data[1][start_ind:end_ind],
-        )
-
-        if is_training:
-          record_kwargs["labels"] = data[2][start_ind:end_ind]
-        else:
-          record_kwargs["dupe_mask"] = stat_utils.mask_duplicates(
-              record_kwargs["items"].reshape(-1, num_neg + 1),
-              axis=1).flatten().astype(np.int8)
-
-        batch_bytes = _construct_record(**record_kwargs)
-
-        writer.write(batch_bytes)
-        batch_count += 1
-
-  # We write to a temp file then atomically rename it to the final file, because
-  # writing directly to the final file can cause the main process to read a
-  # partially written JSON file.
-  ready_file_temp = os.path.join(record_dir, rconst.READY_FILE_TEMP)
-  with tf.gfile.Open(ready_file_temp, "w") as f:
-    json.dump({
-        "batch_size": batch_size,
-        "batch_count": batch_count,
-    }, f)
-  ready_file = os.path.join(record_dir, rconst.READY_FILE)
-  tf.gfile.Rename(ready_file_temp, ready_file)
-
-  if is_training:
-    log_msg("Cycle {} complete. Total time: {:.1f} seconds"
-            .format(train_cycle, timeit.default_timer() - st))
-  else:
-    log_msg("Eval construction complete. Total time: {:.1f} seconds"
-            .format(timeit.default_timer() - st))
-
-
-def _generation_loop(num_workers,           # type: int
-                     cache_paths,           # type: rconst.Paths
-                     num_readers,           # type: int
-                     num_neg,               # type: int
-                     num_train_positives,   # type: int
-                     num_items,             # type: int
-                     num_users,             # type: int
-                     epochs_per_cycle,      # type: int
-                     num_cycles,            # type: int
-                     train_batch_size,      # type: int
-                     eval_batch_size,       # type: int
-                     deterministic,         # type: bool
-                     match_mlperf           # type: bool
-                    ):
-  # type: (...) -> None
-  """Primary run loop for data file generation."""
-
-  log_msg("Entering generation loop.")
-  tf.gfile.MakeDirs(cache_paths.train_epoch_dir)
-  tf.gfile.MakeDirs(cache_paths.eval_data_subdir)
-
-  training_shards = [os.path.join(cache_paths.train_shard_subdir, i) for i in
-                     tf.gfile.ListDirectory(cache_paths.train_shard_subdir)]
-
-  shared_kwargs = dict(
-      num_workers=multiprocessing.cpu_count(), cache_paths=cache_paths,
-      num_readers=num_readers, num_items=num_items,
-      training_shards=training_shards, deterministic=deterministic,
-      match_mlperf=match_mlperf
-  )
-
-  # Training blocks on the creation of the first epoch, so the num_workers
-  # limit is not respected for this invocation
-  train_cycle = 0
-  _construct_records(
-      is_training=True, train_cycle=train_cycle, num_neg=num_neg,
-      num_positives=num_train_positives, epochs_per_cycle=epochs_per_cycle,
-      batch_size=train_batch_size, **shared_kwargs)
-
-  # Construct evaluation set.
-  shared_kwargs["num_workers"] = num_workers
-  _construct_records(
-      is_training=False, train_cycle=None, num_neg=rconst.NUM_EVAL_NEGATIVES,
-      num_positives=num_users, epochs_per_cycle=1, batch_size=eval_batch_size,
-      **shared_kwargs)
-
-  wait_count = 0
-  start_time = time.time()
-  while train_cycle < num_cycles:
-    ready_epochs = tf.gfile.ListDirectory(cache_paths.train_epoch_dir)
-    if len(ready_epochs) >= rconst.CYCLES_TO_BUFFER:
-      wait_count += 1
-      sleep_time = max([0, wait_count * 5 - (time.time() - start_time)])
-      time.sleep(sleep_time)
-
-      if (wait_count % 10) == 0:
-        log_msg("Waited {} times for data to be consumed."
-                .format(wait_count))
-
-      if time.time() - start_time > rconst.TIMEOUT_SECONDS:
-        log_msg("Waited more than {} seconds. Concluding that this "
-                "process is orphaned and exiting gracefully."
-                .format(rconst.TIMEOUT_SECONDS))
-        sys.exit()
-
-      continue
-
-    train_cycle += 1
-    _construct_records(
-        is_training=True, train_cycle=train_cycle, num_neg=num_neg,
-        num_positives=num_train_positives, epochs_per_cycle=epochs_per_cycle,
-        batch_size=train_batch_size, **shared_kwargs)
-
-    wait_count = 0
-    start_time = time.time()
-    gc.collect()
-
-
-def wait_for_path(fpath):
-  start_time = time.time()
-  while not tf.gfile.Exists(fpath):
-    if time.time() - start_time > rconst.TIMEOUT_SECONDS:
-      log_msg("Waited more than {} seconds. Concluding that this "
-              "process is orphaned and exiting gracefully."
-              .format(rconst.TIMEOUT_SECONDS))
-      sys.exit()
-    time.sleep(1)
-
-def _parse_flagfile(flagfile):
-  """Fill flags with flagfile written by the main process."""
-  tf.logging.info("Waiting for flagfile to appear at {}..."
-                  .format(flagfile))
-  wait_for_path(flagfile)
-  tf.logging.info("flagfile found.")
-
-  # `flags` module opens `flagfile` with `open`, which does not work on
-  # google cloud storage etc.
-  _, flagfile_temp = tempfile.mkstemp()
-  tf.gfile.Copy(flagfile, flagfile_temp, overwrite=True)
-
-  flags.FLAGS([__file__, "--flagfile", flagfile_temp])
-  tf.gfile.Remove(flagfile_temp)
-
-
-def write_alive_file(cache_paths):
-  """Write file to signal that generation process started correctly."""
-  wait_for_path(cache_paths.cache_root)
-
-  log_msg("Signaling that I am alive.")
-  with tf.gfile.Open(cache_paths.subproc_alive, "w") as f:
-    f.write("Generation subproc has started.")
-
-  @atexit.register
-  def remove_alive_file():
-    try:
-      tf.gfile.Remove(cache_paths.subproc_alive)
-    except tf.errors.NotFoundError:
-      return  # Main thread has already deleted the entire cache dir.
-
-
-def main(_):
-  # Note: The async process must execute the following two steps in the
-  #       following order BEFORE doing anything else:
-  #       1) Write the alive file
-  #       2) Wait for the flagfile to be written.
-  global _log_file
-  cache_paths = rconst.Paths(
-      data_dir=flags.FLAGS.data_dir, cache_id=flags.FLAGS.cache_id)
-  write_alive_file(cache_paths=cache_paths)
-
-  flagfile = os.path.join(cache_paths.cache_root, rconst.FLAGFILE)
-  _parse_flagfile(flagfile)
-
-  redirect_logs = flags.FLAGS.redirect_logs
-
-  log_file_name = "data_gen_proc_{}.log".format(cache_paths.cache_id)
-  log_path = os.path.join(cache_paths.data_dir, log_file_name)
-  if log_path.startswith("gs://") and redirect_logs:
-    fallback_log_file = os.path.join(tempfile.gettempdir(), log_file_name)
-    print("Unable to log to {}. Falling back to {}"
-          .format(log_path, fallback_log_file))
-    log_path = fallback_log_file
-
-  # This server is generally run in a subprocess.
-  if redirect_logs:
-    print("Redirecting output of data_async_generation.py process to {}"
-          .format(log_path))
-    _log_file = open(log_path, "wt")  # Note: not tf.gfile.Open().
-  try:
-    log_msg("sys.argv: {}".format(" ".join(sys.argv)))
-
-    if flags.FLAGS.seed is not None:
-      np.random.seed(flags.FLAGS.seed)
-
-    with mlperf_helper.LOGGER(
-        enable=flags.FLAGS.output_ml_perf_compliance_logging):
-      mlperf_helper.set_ncf_root(os.path.split(os.path.abspath(__file__))[0])
-      _generation_loop(
-          num_workers=flags.FLAGS.num_workers,
-          cache_paths=cache_paths,
-          num_readers=flags.FLAGS.num_readers,
-          num_neg=flags.FLAGS.num_neg,
-          num_train_positives=flags.FLAGS.num_train_positives,
-          num_items=flags.FLAGS.num_items,
-          num_users=flags.FLAGS.num_users,
-          epochs_per_cycle=flags.FLAGS.epochs_per_cycle,
-          num_cycles=flags.FLAGS.num_cycles,
-          train_batch_size=flags.FLAGS.train_batch_size,
-          eval_batch_size=flags.FLAGS.eval_batch_size,
-          deterministic=flags.FLAGS.seed is not None,
-          match_mlperf=flags.FLAGS.ml_perf,
-      )
-  except KeyboardInterrupt:
-    log_msg("KeyboardInterrupt registered.")
-  except:
-    traceback.print_exc(file=_log_file)
-    raise
-  finally:
-    log_msg("Shutting down generation subprocess.")
-    sys.stdout.flush()
-    sys.stderr.flush()
-    if redirect_logs:
-      _log_file.close()
-
-
-def define_flags():
-  """Construct flags for the server."""
-  flags.DEFINE_integer(name="num_workers", default=multiprocessing.cpu_count(),
-                       help="Size of the negative generation worker pool.")
-  flags.DEFINE_string(name="data_dir", default=None,
-                      help="The data root. (used to construct cache paths.)")
-  flags.DEFINE_string(name="cache_id", default=None,
-                      help="The cache_id generated in the main process.")
-  flags.DEFINE_integer(name="num_readers", default=4,
-                       help="Number of reader datasets in training. This sets"
-                            "how the epoch files are sharded.")
-  flags.DEFINE_integer(name="num_neg", default=None,
-                       help="The Number of negative instances to pair with a "
-                            "positive instance.")
-  flags.DEFINE_integer(name="num_train_positives", default=None,
-                       help="The number of positive training examples.")
-  flags.DEFINE_integer(name="num_items", default=None,
-                       help="Number of items from which to select negatives.")
-  flags.DEFINE_integer(name="num_users", default=None,
-                       help="The number of unique users. Used for evaluation.")
-  flags.DEFINE_integer(name="epochs_per_cycle", default=1,
-                       help="The number of epochs of training data to produce"
-                            "at a time.")
-  flags.DEFINE_integer(name="num_cycles", default=None,
-                       help="The number of cycles to produce training data "
-                            "for.")
-  flags.DEFINE_integer(name="train_batch_size", default=None,
-                       help="The batch size with which training TFRecords will "
-                            "be chunked.")
-  flags.DEFINE_integer(name="eval_batch_size", default=None,
-                       help="The batch size with which evaluation TFRecords "
-                            "will be chunked.")
-  flags.DEFINE_boolean(name="redirect_logs", default=False,
-                       help="Catch logs and write them to a file. "
-                            "(Useful if this is run as a subprocess)")
-  flags.DEFINE_boolean(name="use_tf_logging", default=False,
-                       help="Use tf.logging instead of log file.")
-  flags.DEFINE_integer(name="seed", default=None,
-                       help="NumPy random seed to set at startup. If not "
-                            "specified, a seed will not be set.")
-  flags.DEFINE_boolean(name="ml_perf", default=None,
-                       help="Match MLPerf. See ncf_main.py for details.")
-  flags.DEFINE_bool(name="output_ml_perf_compliance_logging", default=None,
-                    help="Output the MLPerf compliance logging. See "
-                         "ncf_main.py for details.")
-
-  flags.mark_flags_as_required(["data_dir", "cache_id"])
-
-if __name__ == "__main__":
-  define_flags()
-  absl_app.run(main)
--- a/official/recommendation/data_pipeline.py
+++ b/official/recommendation/data_pipeline.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Asynchronous data producer for the NCF pipeline."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import atexit
+import functools
+import os
+import sys
+import tempfile
+import threading
+import time
+import timeit
+import traceback
+import typing
+
+import numpy as np
+import six
+from six.moves import queue
+import tensorflow as tf
+from tensorflow.contrib.tpu.python.tpu.datasets import StreamingFilesDataset
+
+from official.datasets import movielens
+from official.recommendation import constants as rconst
+from official.recommendation import popen_helper
+from official.recommendation import stat_utils
+
+
+SUMMARY_TEMPLATE = """General:
+{spacer}Num users: {num_users}
+{spacer}Num items: {num_items}
+
+Training:
+{spacer}Positive count:          {train_pos_ct}
+{spacer}Batch size:              {train_batch_size} {multiplier}
+{spacer}Batch count per epoch:   {train_batch_ct}
+
+Eval:
+{spacer}Positive count:          {eval_pos_ct}
+{spacer}Batch size:              {eval_batch_size} {multiplier}
+{spacer}Batch count per epoch:   {eval_batch_ct}"""
+
+
+_TRAIN_FEATURE_MAP = {
+    movielens.USER_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
+    movielens.ITEM_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
+    rconst.MASK_START_INDEX: tf.FixedLenFeature([1], dtype=tf.string),
+    "labels": tf.FixedLenFeature([], dtype=tf.string),
+}
+
+
+_EVAL_FEATURE_MAP = {
+    movielens.USER_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
+    movielens.ITEM_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
+    rconst.DUPLICATE_MASK: tf.FixedLenFeature([], dtype=tf.string)
+}
+
+
+class DatasetManager(object):
+  """Helper class for handling TensorFlow specific data tasks.
+
+  This class takes the (relatively) framework agnostic work done by the data
+  constructor classes and handles the TensorFlow specific portions (TFRecord
+  management, tf.Dataset creation, etc.).
+  """
+  def __init__(self, is_training, stream_files, batches_per_epoch,
+               shard_root=None, deterministic=False):
+    # type: (bool, bool, int, typing.Optional[str], bool) -> None
+    """Constructs a `DatasetManager` instance.
+    Args:
+      is_training: Boolean of whether the data provided is training or
+        evaluation data. This determines whether to reuse the data
+        (if is_training=False) and the exact structure to use when storing and
+        yielding data.
+      stream_files: Boolean indicating whether data should be serialized and
+        written to file shards.
+      batches_per_epoch: The number of batches in a single epoch.
+      shard_root: The base directory to be used when stream_files=True.
+      deterministic: Forgo non-deterministic speedups. (i.e. sloppy=True)
+    """
+    self._is_training = is_training
+    self._deterministic = deterministic
+    self._stream_files = stream_files
+    self._writers = []
+    self._write_locks = [threading.RLock() for _ in
+                         range(rconst.NUM_FILE_SHARDS)] if stream_files else []
+    self._batches_per_epoch = batches_per_epoch
+    self._epochs_completed = 0
+    self._epochs_requested = 0
+    self._shard_root = shard_root
+
+    self._result_queue = queue.Queue()
+    self._result_reuse = []
+
+  @property
+  def current_data_root(self):
+    subdir = (rconst.TRAIN_FOLDER_TEMPLATE.format(self._epochs_completed)
+              if self._is_training else rconst.EVAL_FOLDER)
+    return os.path.join(self._shard_root, subdir)
+
+  def buffer_reached(self):
+    # Only applicable for training.
+    return (self._epochs_completed - self._epochs_requested >=
+            rconst.CYCLES_TO_BUFFER and self._is_training)
+
+  @staticmethod
+  def _serialize(data):
+    """Convert NumPy arrays into a TFRecords entry."""
+
+    feature_dict = {
+        k: tf.train.Feature(bytes_list=tf.train.BytesList(
+            value=[memoryview(v).tobytes()])) for k, v in data.items()}
+
+    return tf.train.Example(
+        features=tf.train.Features(feature=feature_dict)).SerializeToString()
+
+  def _deserialize(self, serialized_data, batch_size):
+    """Convert serialized TFRecords into tensors.
+
+    Args:
+      serialized_data: A tensor containing serialized records.
+      batch_size: The data arrives pre-batched, so batch size is needed to
+        deserialize the data.
+    """
+    feature_map = _TRAIN_FEATURE_MAP if self._is_training else _EVAL_FEATURE_MAP
+    features = tf.parse_single_example(serialized_data, feature_map)
+
+    users = tf.reshape(tf.decode_raw(
+        features[movielens.USER_COLUMN], rconst.USER_DTYPE), (batch_size,))
+    items = tf.reshape(tf.decode_raw(
+        features[movielens.ITEM_COLUMN], rconst.ITEM_DTYPE), (batch_size,))
+
+    def decode_binary(data_bytes):
+      # tf.decode_raw does not support bool as a decode type. As a result it is
+      # necessary to decode to int8 (7 of the bits will be ignored) and then
+      # cast to bool.
+      return tf.reshape(tf.cast(tf.decode_raw(data_bytes, tf.int8), tf.bool),
+                        (batch_size,))
+
+    if self._is_training:
+      mask_start_index = tf.decode_raw(
+          features[rconst.MASK_START_INDEX], tf.int32)[0]
+      valid_point_mask = tf.less(tf.range(batch_size), mask_start_index)
+
+      return {
+          movielens.USER_COLUMN: users,
+          movielens.ITEM_COLUMN: items,
+          rconst.VALID_POINT_MASK: valid_point_mask,
+      }, decode_binary(features["labels"])
+
+    return {
+        movielens.USER_COLUMN: users,
+        movielens.ITEM_COLUMN: items,
+        rconst.DUPLICATE_MASK: decode_binary(features[rconst.DUPLICATE_MASK]),
+    }
+
+  def put(self, index, data):
+    # type: (int, dict) -> None
+    """Store data for later consumption.
+
+    Because there are several paths for storing and yielding data (queues,
+    lists, files) the data producer simply provides the data in a standard
+    format at which point the dataset manager handles storing it in the correct
+    form.
+
+    Args:
+      index: Used to select shards when writing to files.
+      data: A dict of the data to be stored. This method mutates data, and
+        therefore expects to be the only consumer.
+    """
+
+    if self._stream_files:
+      example_bytes = self._serialize(data)
+      with self._write_locks[index % rconst.NUM_FILE_SHARDS]:
+        self._writers[index % rconst.NUM_FILE_SHARDS].write(example_bytes)
+
+    else:
+      if self._is_training:
+        mask_start_index = data.pop(rconst.MASK_START_INDEX)
+        batch_size = data[movielens.ITEM_COLUMN].shape[0]
+        data[rconst.VALID_POINT_MASK] = np.less(np.arange(batch_size),
+                                                mask_start_index)
+        data = (data, data.pop("labels"))
+      self._result_queue.put(data)
+
+  def start_construction(self):
+    if self._stream_files:
+      tf.gfile.MakeDirs(self.current_data_root)
+      template = os.path.join(self.current_data_root, rconst.SHARD_TEMPLATE)
+      self._writers = [tf.io.TFRecordWriter(template.format(i))
+                       for i in range(rconst.NUM_FILE_SHARDS)]
+
+  def end_construction(self):
+    if self._stream_files:
+      [writer.close() for writer in self._writers]
+      self._writers = []
+      self._result_queue.put(self.current_data_root)
+
+    self._epochs_completed += 1
+
+  def data_generator(self, epochs_between_evals):
+    """Yields examples during local training."""
+    assert not self._stream_files
+    assert self._is_training or epochs_between_evals == 1
+
+    if self._is_training:
+      for _ in range(self._batches_per_epoch * epochs_between_evals):
+        yield self._result_queue.get(timeout=300)
+
+    else:
+      if self._result_reuse:
+        assert len(self._result_reuse) == self._batches_per_epoch
+
+        for i in self._result_reuse:
+          yield i
+      else:
+        # First epoch.
+        for _ in range(self._batches_per_epoch * epochs_between_evals):
+          result = self._result_queue.get(timeout=300)
+          self._result_reuse.append(result)
+          yield result
+
+
+  def get_dataset(self, batch_size, epochs_between_evals):
+    """Construct the dataset to be used for training and eval.
+
+    For local training, data is provided through Dataset.from_generator. For
+    remote training (TPUs) the data is first serialized to files and then sent
+    to the TPU through a StreamingFilesDataset.
+
+    Args:
+      batch_size: The per-device batch size of the dataset.
+      epochs_between_evals: How many epochs worth of data to yield.
+        (Generator mode only.)
+    """
+    self._epochs_requested += 1
+    if self._stream_files:
+      if epochs_between_evals > 1:
+        raise ValueError("epochs_between_evals > 1 not supported for file "
+                         "based dataset.")
+      epoch_data_dir = self._result_queue.get(timeout=300)
+      if not self._is_training:
+        self._result_queue.put(epoch_data_dir)  # Eval data is reused.
+
+      file_pattern = os.path.join(
+          epoch_data_dir, rconst.SHARD_TEMPLATE.format("*"))
+      dataset = StreamingFilesDataset(
+          files=file_pattern, worker_job="worker",
+          num_parallel_reads=rconst.NUM_FILE_SHARDS, num_epochs=1,
+          sloppy=not self._deterministic)
+      map_fn = functools.partial(self._deserialize, batch_size=batch_size)
+      dataset = dataset.map(map_fn, num_parallel_calls=16)
+
+    else:
+      types = {movielens.USER_COLUMN: rconst.USER_DTYPE,
+               movielens.ITEM_COLUMN: rconst.ITEM_DTYPE}
+      shapes = {movielens.USER_COLUMN: tf.TensorShape([batch_size]),
+                movielens.ITEM_COLUMN: tf.TensorShape([batch_size])}
+
+      if self._is_training:
+        types[rconst.VALID_POINT_MASK] = np.bool
+        shapes[rconst.VALID_POINT_MASK] = tf.TensorShape([batch_size])
+
+        types = (types, np.bool)
+        shapes = (shapes, tf.TensorShape([batch_size]))
+
+      else:
+        types[rconst.DUPLICATE_MASK] = np.bool
+        shapes[rconst.DUPLICATE_MASK] = tf.TensorShape([batch_size])
+
+      data_generator = functools.partial(
+          self.data_generator, epochs_between_evals=epochs_between_evals)
+      dataset = tf.data.Dataset.from_generator(
+          generator=data_generator, output_types=types,
+          output_shapes=shapes)
+
+    return dataset.prefetch(16)
+
+  def make_input_fn(self, batch_size):
+    """Create an input_fn which checks for batch size consistency."""
+
+    def input_fn(params):
+      param_batch_size = (params["batch_size"] if self._is_training else
+                          params["eval_batch_size"])
+      if batch_size != param_batch_size:
+        raise ValueError("producer batch size ({}) differs from params batch "
+                         "size ({})".format(batch_size, param_batch_size))
+
+      epochs_between_evals = (params.get("epochs_between_evals", 1)
+                              if self._is_training else 1)
+      return self.get_dataset(batch_size=batch_size,
+                              epochs_between_evals=epochs_between_evals)
+
+    return input_fn
+
+
+class BaseDataConstructor(threading.Thread):
+  """Data constructor base class.
+
+  This class manages the control flow for constructing data. It is not meant
+  to be used directly, but instead subclasses should implement the following
+  two methods:
+
+    self.construct_lookup_variables
+    self.lookup_negative_items
+
+  """
+  def __init__(self,
+               maximum_number_epochs,   # type: int
+               num_users,               # type: int
+               num_items,               # type: int
+               user_map,                # type: dict
+               item_map,                # type: dict
+               train_pos_users,         # type: np.ndarray
+               train_pos_items,         # type: np.ndarray
+               train_batch_size,        # type: int
+               batches_per_train_step,  # type: int
+               num_train_negatives,     # type: int
+               eval_pos_users,          # type: np.ndarray
+               eval_pos_items,          # type: np.ndarray
+               eval_batch_size,         # type: int
+               batches_per_eval_step,   # type: int
+               stream_files,            # type: bool
+               deterministic=False      # type: bool
+              ):
+    # General constants
+    self._maximum_number_epochs = maximum_number_epochs
+    self._num_users = num_users
+    self._num_items = num_items
+    self.user_map = user_map
+    self.item_map = item_map
+    self._train_pos_users = train_pos_users
+    self._train_pos_items = train_pos_items
+    self.train_batch_size = train_batch_size
+    self._num_train_negatives = num_train_negatives
+    self._batches_per_train_step = batches_per_train_step
+    self._eval_pos_users = eval_pos_users
+    self._eval_pos_items = eval_pos_items
+    self.eval_batch_size = eval_batch_size
+
+    # Training
+    if self._train_pos_users.shape != self._train_pos_items.shape:
+      raise ValueError(
+          "User positives ({}) is different from item positives ({})".format(
+              self._train_pos_users.shape, self._train_pos_items.shape))
+
+    (self._train_pos_count,) = self._train_pos_users.shape
+    self._elements_in_epoch = (1 + num_train_negatives) * self._train_pos_count
+    self.train_batches_per_epoch = self._count_batches(
+        self._elements_in_epoch, train_batch_size, batches_per_train_step)
+
+    # Evaluation
+    if eval_batch_size % (1 + rconst.NUM_EVAL_NEGATIVES):
+      raise ValueError("Eval batch size {} is not divisible by {}".format(
+          eval_batch_size, 1 + rconst.NUM_EVAL_NEGATIVES))
+    self._eval_users_per_batch = int(
+        eval_batch_size // (1 + rconst.NUM_EVAL_NEGATIVES))
+    self._eval_elements_in_epoch = num_users * (1 + rconst.NUM_EVAL_NEGATIVES)
+    self.eval_batches_per_epoch = self._count_batches(
+        self._eval_elements_in_epoch, eval_batch_size, batches_per_eval_step)
+
+    # Intermediate artifacts
+    self._current_epoch_order = np.empty(shape=(0,))
+    self._shuffle_iterator = None
+
+    self._shuffle_with_forkpool = not stream_files
+    if stream_files:
+      self._shard_root = tempfile.mkdtemp(prefix="ncf_")
+      atexit.register(tf.gfile.DeleteRecursively, dirname=self._shard_root)
+    else:
+      self._shard_root = None
+
+    self._train_dataset = DatasetManager(
+        True, stream_files, self.train_batches_per_epoch, self._shard_root,
+        deterministic)
+    self._eval_dataset = DatasetManager(
+        False, stream_files, self.eval_batches_per_epoch, self._shard_root,
+        deterministic)
+
+    # Threading details
+    super(BaseDataConstructor, self).__init__()
+    self.daemon = True
+    self._stop_loop = False
+    self._fatal_exception = None
+    self.deterministic = deterministic
+
+  def __str__(self):
+    multiplier = ("(x{} devices)".format(self._batches_per_train_step)
+                  if self._batches_per_train_step > 1 else "")
+    summary = SUMMARY_TEMPLATE.format(
+        spacer="  ", num_users=self._num_users, num_items=self._num_items,
+        train_pos_ct=self._train_pos_count,
+        train_batch_size=self.train_batch_size,
+        train_batch_ct=self.train_batches_per_epoch,
+        eval_pos_ct=self._num_users, eval_batch_size=self.eval_batch_size,
+        eval_batch_ct=self.eval_batches_per_epoch, multiplier=multiplier)
+    return super(BaseDataConstructor, self).__str__() + "\n" + summary
+
+  @staticmethod
+  def _count_batches(example_count, batch_size, batches_per_step):
+    """Determine the number of batches, rounding up to fill all devices."""
+    x = (example_count + batch_size - 1) // batch_size
+    return (x + batches_per_step - 1) // batches_per_step * batches_per_step
+
+  def stop_loop(self):
+    self._stop_loop = True
+
+  def construct_lookup_variables(self):
+    """Perform any one time pre-compute work."""
+    raise NotImplementedError
+
+  def lookup_negative_items(self, **kwargs):
+    """Randomly sample negative items for given users."""
+    raise NotImplementedError
+
+  def _run(self):
+    atexit.register(self.stop_loop)
+    self._start_shuffle_iterator()
+    self.construct_lookup_variables()
+    self._construct_training_epoch()
+    self._construct_eval_epoch()
+    for _ in range(self._maximum_number_epochs - 1):
+      self._construct_training_epoch()
+    self.stop_loop()
+
+  def run(self):
+    try:
+      self._run()
+    except Exception as e:
+      # The Thread base class swallows stack traces, so unfortunately it is
+      # necessary to catch and re-raise to get debug output
+      traceback.print_exc()
+      self._fatal_exception = e
+      sys.stderr.flush()
+      raise
+
+  def _start_shuffle_iterator(self):
+    if self._shuffle_with_forkpool:
+      pool = popen_helper.get_forkpool(3, closing=False)
+    else:
+      pool = popen_helper.get_threadpool(1, closing=False)
+    atexit.register(pool.close)
+    args = [(self._elements_in_epoch, stat_utils.random_int32())
+            for _ in range(self._maximum_number_epochs)]
+    imap = pool.imap if self.deterministic else pool.imap_unordered
+    self._shuffle_iterator = imap(stat_utils.permutation, args)
+
+  def _get_training_batch(self, i):
+    """Construct a single batch of training data.
+
+    Args:
+      i: The index of the batch. This is used when stream_files=True to assign
+        data to file shards.
+    """
+    batch_indices = self._current_epoch_order[i * self.train_batch_size:
+                                              (i + 1) * self.train_batch_size]
+    (mask_start_index,) = batch_indices.shape
+
+    batch_ind_mod = np.mod(batch_indices, self._train_pos_count)
+    users = self._train_pos_users[batch_ind_mod]
+
+    negative_indices = np.greater_equal(batch_indices, self._train_pos_count)
+    negative_users = users[negative_indices]
+
+    negative_items = self.lookup_negative_items(negative_users=negative_users)
+
+    items = self._train_pos_items[batch_ind_mod]
+    items[negative_indices] = negative_items
+
+    labels = np.logical_not(negative_indices)
+
+    # Pad last partial batch
+    pad_length = self.train_batch_size - mask_start_index
+    if pad_length:
+      # We pad with arange rather than zeros because the network will still
+      # compute logits for padded examples, and padding with zeros would create
+      # a very "hot" embedding key which can have performance implications.
+      user_pad = np.arange(pad_length, dtype=users.dtype) % self._num_users
+      item_pad = np.arange(pad_length, dtype=items.dtype) % self._num_items
+      label_pad = np.zeros(shape=(pad_length,), dtype=labels.dtype)
+      users = np.concatenate([users, user_pad])
+      items = np.concatenate([items, item_pad])
+      labels = np.concatenate([labels, label_pad])
+
+    self._train_dataset.put(i, {
+        movielens.USER_COLUMN: users,
+        movielens.ITEM_COLUMN: items,
+        rconst.MASK_START_INDEX: np.array(mask_start_index, dtype=np.int32),
+        "labels": labels,
+    })
+
+  def _wait_to_construct_train_epoch(self):
+    count = 0
+    while self._train_dataset.buffer_reached() and not self._stop_loop:
+      time.sleep(0.01)
+      count += 1
+      if count >= 100 and np.log10(count) == np.round(np.log10(count)):
+        tf.logging.info(
+            "Waited {} times for training data to be consumed".format(count))
+
+  def _construct_training_epoch(self):
+    """Loop to construct a batch of training data."""
+    self._wait_to_construct_train_epoch()
+    start_time = timeit.default_timer()
+    if self._stop_loop:
+      return
+
+    self._train_dataset.start_construction()
+    map_args = list(range(self.train_batches_per_epoch))
+    self._current_epoch_order = next(self._shuffle_iterator)
+
+    get_pool = (popen_helper.get_fauxpool if self.deterministic else
+                popen_helper.get_threadpool)
+    with get_pool(6) as pool:
+      pool.map(self._get_training_batch, map_args)
+    self._train_dataset.end_construction()
+
+    tf.logging.info("Epoch construction complete. Time: {:.1f} seconds".format(
+        timeit.default_timer() - start_time))
+
+  @staticmethod
+  def _assemble_eval_batch(users, positive_items, negative_items,
+                           users_per_batch):
+    """Construct duplicate_mask and structure data accordingly.
+
+    The positive items should be last so that they lose ties. However, they
+    should not be masked out if the true eval positive happens to be
+    selected as a negative. So instead, the positive is placed in the first
+    position, and then switched with the last element after the duplicate
+    mask has been computed.
+
+    Args:
+      users: An array of users in a batch. (should be identical along axis 1)
+      positive_items: An array (batch_size x 1) of positive item indices.
+      negative_items: An array of negative item indices.
+      users_per_batch: How many users should be in the batch. This is passed
+        as an argument so that ncf_test.py can use this method.
+
+    Returns:
+      User, item, and duplicate_mask arrays.
+    """
+    items = np.concatenate([positive_items, negative_items], axis=1)
+
+    # We pad the users and items here so that the duplicate mask calculation
+    # will include padding. The metric function relies on all padded elements
+    # except the positive being marked as duplicate to mask out padded points.
+    if users.shape[0] < users_per_batch:
+      pad_rows = users_per_batch - users.shape[0]
+      padding = np.zeros(shape=(pad_rows, users.shape[1]), dtype=np.int32)
+      users = np.concatenate([users, padding.astype(users.dtype)], axis=0)
+      items = np.concatenate([items, padding.astype(items.dtype)], axis=0)
+
+    duplicate_mask = stat_utils.mask_duplicates(items, axis=1).astype(np.bool)
+
+    items[:, (0, -1)] = items[:, (-1, 0)]
+    duplicate_mask[:, (0, -1)] = duplicate_mask[:, (-1, 0)]
+
+    assert users.shape == items.shape == duplicate_mask.shape
+    return users, items, duplicate_mask
+
+  def _get_eval_batch(self, i):
+    """Construct a single batch of evaluation data.
+
+    Args:
+      i: The index of the batch.
+    """
+    low_index = i * self._eval_users_per_batch
+    high_index = (i + 1) * self._eval_users_per_batch
+    users = np.repeat(self._eval_pos_users[low_index:high_index, np.newaxis],
+                      1 + rconst.NUM_EVAL_NEGATIVES, axis=1)
+    positive_items = self._eval_pos_items[low_index:high_index, np.newaxis]
+    negative_items = (self.lookup_negative_items(negative_users=users[:, :-1])
+                      .reshape(-1, rconst.NUM_EVAL_NEGATIVES))
+
+    users, items, duplicate_mask = self._assemble_eval_batch(
+        users, positive_items, negative_items, self._eval_users_per_batch)
+
+    self._eval_dataset.put(i, {
+        movielens.USER_COLUMN: users.flatten(),
+        movielens.ITEM_COLUMN: items.flatten(),
+        rconst.DUPLICATE_MASK: duplicate_mask.flatten(),
+    })
+
+  def _construct_eval_epoch(self):
+    """Loop to construct data for evaluation."""
+    if self._stop_loop:
+      return
+
+    start_time = timeit.default_timer()
+
+    self._eval_dataset.start_construction()
+    map_args = [i for i in range(self.eval_batches_per_epoch)]
+
+    get_pool = (popen_helper.get_fauxpool if self.deterministic else
+                popen_helper.get_threadpool)
+    with get_pool(6) as pool:
+      pool.map(self._get_eval_batch, map_args)
+    self._eval_dataset.end_construction()
+
+    tf.logging.info("Eval construction complete. Time: {:.1f} seconds".format(
+        timeit.default_timer() - start_time))
+
+  def make_input_fn(self, is_training):
+    # It isn't feasible to provide a foolproof check, so this is designed to
+    # catch most failures rather than provide an exhaustive guard.
+    if self._fatal_exception is not None:
+      raise ValueError("Fatal exception in the data production loop: {}"
+                       .format(self._fatal_exception))
+
+    return (
+        self._train_dataset.make_input_fn(self.train_batch_size) if is_training
+        else self._eval_dataset.make_input_fn(self.eval_batch_size))
+
+
+class DummyConstructor(threading.Thread):
+  """Class for running with synthetic data."""
+  def run(self):
+    pass
+
+  def stop_loop(self):
+    pass
+
+  @staticmethod
+  def make_input_fn(is_training):
+    """Construct training input_fn that uses synthetic data."""
+
+    def input_fn(params):
+      """Generated input_fn for the given epoch."""
+      batch_size = (params["batch_size"] if is_training else
+                    params["eval_batch_size"])
+      num_users = params["num_users"]
+      num_items = params["num_items"]
+
+      users = tf.random_uniform([batch_size], dtype=tf.int32, minval=0,
+                                maxval=num_users)
+      items = tf.random_uniform([batch_size], dtype=tf.int32, minval=0,
+                                maxval=num_items)
+
+      if is_training:
+        valid_point_mask = tf.cast(tf.random_uniform(
+            [batch_size], dtype=tf.int32, minval=0, maxval=2), tf.bool)
+        labels = tf.cast(tf.random_uniform(
+            [batch_size], dtype=tf.int32, minval=0, maxval=2), tf.bool)
+        data = {
+            movielens.USER_COLUMN: users,
+            movielens.ITEM_COLUMN: items,
+            rconst.VALID_POINT_MASK: valid_point_mask,
+        }, labels
+      else:
+        dupe_mask = tf.cast(tf.random_uniform([batch_size], dtype=tf.int32,
+                                              minval=0, maxval=2), tf.bool)
+        data = {
+            movielens.USER_COLUMN: users,
+            movielens.ITEM_COLUMN: items,
+            rconst.DUPLICATE_MASK: dupe_mask,
+        }
+
+      dataset = tf.data.Dataset.from_tensors(data).repeat(
+          rconst.SYNTHETIC_BATCHES_PER_EPOCH * params["batches_per_step"])
+      dataset = dataset.prefetch(32)
+      return dataset
+
+    return input_fn
+
+
+class MaterializedDataConstructor(BaseDataConstructor):
+  """Materialize a table of negative examples for fast negative generation.
+
+  This class creates a table (num_users x num_items) containing all of the
+  negative examples for each user. This table is conceptually ragged; that is to
+  say the items dimension will have a number of unused elements at the end equal
+  to the number of positive elements for a given user. For instance:
+
+  num_users = 3
+  num_items = 5
+  positives = [[1, 3], [0], [1, 2, 3, 4]]
+
+  will generate a negative table:
+  [
+    [0         2         4         int32max  int32max],
+    [1         2         3         4         int32max],
+    [0         int32max  int32max  int32max  int32max],
+  ]
+
+  and a vector of per-user negative counts, which in this case would be:
+    [3, 4, 1]
+
+  When sampling negatives, integers are (nearly) uniformly selected from the
+  range [0, per_user_neg_count[user]) which gives a column_index, at which
+  point the negative can be selected as:
+    negative_table[user, column_index]
+
+  This technique will not scale; however MovieLens is small enough that even
+  a pre-compute which is quadratic in problem size will still fit in memory. A
+  more scalable lookup method is in the works.
+  """
+  def __init__(self, *args, **kwargs):
+    super(MaterializedDataConstructor, self).__init__(*args, **kwargs)
+    self._negative_table = None
+    self._per_user_neg_count = None
+
+  def construct_lookup_variables(self):
+    # Materialize negatives for fast lookup sampling.
+    start_time = timeit.default_timer()
+    inner_bounds = np.argwhere(self._train_pos_users[1:] -
+                               self._train_pos_users[:-1])[:, 0] + 1
+    (upper_bound,) = self._train_pos_users.shape
+    index_bounds = [0] + inner_bounds.tolist() + [upper_bound]
+    self._negative_table = np.zeros(shape=(self._num_users, self._num_items),
+                                    dtype=rconst.ITEM_DTYPE)
+
+    # Set the table to the max value to make sure the embedding lookup will fail
+    # if we go out of bounds, rather than just overloading item zero.
+    self._negative_table += np.iinfo(rconst.ITEM_DTYPE).max
+    assert self._num_items < np.iinfo(rconst.ITEM_DTYPE).max
+
+    # Reuse arange during generation. np.delete will make a copy.
+    full_set = np.arange(self._num_items, dtype=rconst.ITEM_DTYPE)
+
+    self._per_user_neg_count = np.zeros(
+        shape=(self._num_users,), dtype=np.int32)
+
+    # Threading does not improve this loop. For some reason, the np.delete
+    # call does not parallelize well. Multiprocessing incurs too much
+    # serialization overhead to be worthwhile.
+    for i in range(self._num_users):
+      positives = self._train_pos_items[index_bounds[i]:index_bounds[i+1]]
+      negatives = np.delete(full_set, positives)
+      self._per_user_neg_count[i] = self._num_items - positives.shape[0]
+      self._negative_table[i, :self._per_user_neg_count[i]] = negatives
+
+    tf.logging.info("Negative sample table built. Time: {:.1f} seconds".format(
+        timeit.default_timer() - start_time))
+
+  def lookup_negative_items(self, negative_users, **kwargs):
+    negative_item_choice = stat_utils.very_slightly_biased_randint(
+        self._per_user_neg_count[negative_users])
+    return self._negative_table[negative_users, negative_item_choice]
+
+
+class BisectionDataConstructor(BaseDataConstructor):
+  """Use bisection to index within positive examples.
+
+  This class tallies the number of negative items which appear before each
+  positive item for a user. This means that in order to select the ith negative
+  item for a user, it only needs to determine which two positive items bound
+  it at which point the item id for the ith negative is a simply algebraic
+  expression.
+  """
+  def __init__(self, *args, **kwargs):
+    super(BisectionDataConstructor, self).__init__(*args, **kwargs)
+    self.index_bounds = None
+    self._sorted_train_pos_items = None
+    self._total_negatives = None
+
+  def _index_segment(self, user):
+    lower, upper = self.index_bounds[user:user+2]
+    items = self._sorted_train_pos_items[lower:upper]
+
+    negatives_since_last_positive = np.concatenate(
+        [items[0][np.newaxis], items[1:] - items[:-1] - 1])
+
+    return np.cumsum(negatives_since_last_positive)
+
+  def construct_lookup_variables(self):
+    start_time = timeit.default_timer()
+    inner_bounds = np.argwhere(self._train_pos_users[1:] -
+                               self._train_pos_users[:-1])[:, 0] + 1
+    (upper_bound,) = self._train_pos_users.shape
+    self.index_bounds = np.array([0] + inner_bounds.tolist() + [upper_bound])
+
+    # Later logic will assume that the users are in sequential ascending order.
+    assert np.array_equal(self._train_pos_users[self.index_bounds[:-1]],
+                          np.arange(self._num_users))
+
+    self._sorted_train_pos_items = self._train_pos_items.copy()
+
+    for i in range(self._num_users):
+      lower, upper = self.index_bounds[i:i+2]
+      self._sorted_train_pos_items[lower:upper].sort()
+
+    self._total_negatives = np.concatenate([
+        self._index_segment(i) for i in range(self._num_users)])
+
+    tf.logging.info("Negative total vector built. Time: {:.1f} seconds".format(
+        timeit.default_timer() - start_time))
+
+  def lookup_negative_items(self, negative_users, **kwargs):
+    output = np.zeros(shape=negative_users.shape, dtype=rconst.ITEM_DTYPE) - 1
+
+    left_index = self.index_bounds[negative_users]
+    right_index = self.index_bounds[negative_users + 1] - 1
+
+    num_positives = right_index - left_index + 1
+    num_negatives = self._num_items - num_positives
+    neg_item_choice = stat_utils.very_slightly_biased_randint(num_negatives)
+
+    # Shortcuts:
+    # For points where the negative is greater than or equal to the tally before
+    # the last positive point there is no need to bisect. Instead the item id
+    # corresponding to the negative item choice is simply:
+    #   last_postive_index + 1 + (neg_choice - last_negative_tally)
+    # Similarly, if the selection is less than the tally at the first positive
+    # then the item_id is simply the selection.
+    #
+    # Because MovieLens organizes popular movies into low integers (which is
+    # preserved through the preprocessing), the first shortcut is very
+    # efficient, allowing ~60% of samples to bypass the bisection. For the same
+    # reason, the second shortcut is rarely triggered (<0.02%) and is therefore
+    # not worth implementing.
+    use_shortcut = neg_item_choice >= self._total_negatives[right_index]
+    output[use_shortcut] = (
+        self._sorted_train_pos_items[right_index] + 1 +
+        (neg_item_choice - self._total_negatives[right_index])
+    )[use_shortcut]
+
+    if np.all(use_shortcut):
+      # The bisection code is ill-posed when there are no elements.
+      return output
+
+    not_use_shortcut = np.logical_not(use_shortcut)
+    left_index = left_index[not_use_shortcut]
+    right_index = right_index[not_use_shortcut]
+    neg_item_choice = neg_item_choice[not_use_shortcut]
+
+    num_loops = np.max(
+        np.ceil(np.log2(num_positives[not_use_shortcut])).astype(np.int32))
+
+    for i in range(num_loops):
+      mid_index = (left_index + right_index) // 2
+      right_criteria = self._total_negatives[mid_index] > neg_item_choice
+      left_criteria = np.logical_not(right_criteria)
+
+      right_index[right_criteria] = mid_index[right_criteria]
+      left_index[left_criteria] = mid_index[left_criteria]
+
+    # Expected state after bisection pass:
+    #   The right index is the smallest index whose tally is greater than the
+    #   negative item choice index.
+
+    assert np.all((right_index - left_index) <= 1)
+
+    output[not_use_shortcut] = (
+        self._sorted_train_pos_items[right_index] -
+        (self._total_negatives[right_index] - neg_item_choice)
+    )
+
+    assert np.all(output >= 0)
+
+    return output
+
+
+def get_constructor(name):
+  if name == "bisection":
+    return BisectionDataConstructor
+  if name == "materialized":
+    return MaterializedDataConstructor
+  raise ValueError("Unrecognized constructor: {}".format(name))
--- a/official/recommendation/data_preprocessing.py
+++ b/official/recommendation/data_preprocessing.py
@@ -18,34 +18,21 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

-import atexit
-import contextlib
-import gc
-import hashlib
-import multiprocessing
-import json
 import os
 import pickle
-import signal
-import socket
-import subprocess
 import time
 import timeit
 import typing

 # pylint: disable=wrong-import-order
-from absl import app as absl_app
-from absl import flags
 import numpy as np
 import pandas as pd
-import six
 import tensorflow as tf
 # pylint: enable=wrong-import-order

 from official.datasets import movielens
 from official.recommendation import constants as rconst
-from official.recommendation import stat_utils
-from official.recommendation import popen_helper
+from official.recommendation import data_pipeline
 from official.utils.logs import mlperf_helper


@@ -55,43 +42,13 @@ DATASET_TO_NUM_USERS_AND_ITEMS = {
 }


-# Number of batches to run per epoch when using synthetic data. At high batch
-# sizes, we run for more batches than with real data, which is good since
-# running more batches reduces noise when measuring the average batches/second.
-SYNTHETIC_BATCHES_PER_EPOCH = 2000
+_EXPECTED_CACHE_KEYS = (
+    rconst.TRAIN_USER_KEY, rconst.TRAIN_ITEM_KEY, rconst.EVAL_USER_KEY,
+    rconst.EVAL_ITEM_KEY, rconst.USER_MAP, rconst.ITEM_MAP)


-class NCFDataset(object):
-  """Container for training and testing data."""
-
-  def __init__(self, user_map, item_map, num_data_readers, cache_paths,
-               num_train_positives, deterministic=False):
-    # type: (dict, dict, int, rconst.Paths, int, bool) -> None
-    """Assign key values for recommendation dataset.
-
-    Args:
-      user_map: Dict mapping raw user ids to regularized ids.
-      item_map: Dict mapping raw item ids to regularized ids.
-      num_data_readers: The number of reader Datasets used during training.
-      cache_paths: Object containing locations for various cache files.
-      num_train_positives: The number of positive training examples in the
-        dataset.
-      deterministic: Operations should use deterministic, order preserving
-        methods, even at the cost of performance.
-    """
-
-    self.user_map = {int(k): int(v) for k, v in user_map.items()}
-    self.item_map = {int(k): int(v) for k, v in item_map.items()}
-    self.num_users = len(user_map)
-    self.num_items = len(item_map)
-    self.num_data_readers = num_data_readers
-    self.cache_paths = cache_paths
-    self.num_train_positives = num_train_positives
-    self.deterministic = deterministic
-
-
-def _filter_index_sort(raw_rating_path, match_mlperf):
-  # type: (str, bool) -> (pd.DataFrame, dict, dict)
+def _filter_index_sort(raw_rating_path, cache_path):
+  # type: (str, str, bool) -> (dict, bool)
  """Read in data CSV, and output structured data.

  This function reads in the raw CSV of positive items, and performs three
@@ -116,248 +73,129 @@ def _filter_index_sort(raw_rating_path, match_mlperf):

  Args:
    raw_rating_path: The path to the CSV which contains the raw dataset.
-    match_mlperf: If True, change the sorting algorithm to match the MLPerf
-      reference implementation.
+    cache_path: The path to the file where results of this function are saved.

  Returns:
    A filtered, zero-index remapped, sorted dataframe, a dict mapping raw user
    IDs to regularized user IDs, and a dict mapping raw item IDs to regularized
    item IDs.
  """
-  with tf.gfile.Open(raw_rating_path) as f:
-    df = pd.read_csv(f)
-
-  # Get the info of users who have more than 20 ratings on items
-  grouped = df.groupby(movielens.USER_COLUMN)
-  df = grouped.filter(
-      lambda x: len(x) >= rconst.MIN_NUM_RATINGS) # type: pd.DataFrame
-
-  original_users = df[movielens.USER_COLUMN].unique()
-  original_items = df[movielens.ITEM_COLUMN].unique()
-
-  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.PREPROC_HP_MIN_RATINGS,
-                          value=rconst.MIN_NUM_RATINGS)
-
-  # Map the ids of user and item to 0 based index for following processing
-  tf.logging.info("Generating user_map and item_map...")
-  user_map = {user: index for index, user in enumerate(original_users)}
-  item_map = {item: index for index, item in enumerate(original_items)}
-
-  df[movielens.USER_COLUMN] = df[movielens.USER_COLUMN].apply(
-      lambda user: user_map[user])
-  df[movielens.ITEM_COLUMN] = df[movielens.ITEM_COLUMN].apply(
-      lambda item: item_map[item])
-
-  num_users = len(original_users)
-  num_items = len(original_items)
-
-  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.PREPROC_HP_NUM_EVAL,
-                          value=rconst.NUM_EVAL_NEGATIVES)
-  mlperf_helper.ncf_print(
-      key=mlperf_helper.TAGS.PREPROC_HP_SAMPLE_EVAL_REPLACEMENT,
-      value=match_mlperf)
-
-  assert num_users <= np.iinfo(np.int32).max
-  assert num_items <= np.iinfo(np.uint16).max
-  assert df[movielens.USER_COLUMN].max() == num_users - 1
-  assert df[movielens.ITEM_COLUMN].max() == num_items - 1
-
-  # This sort is used to shard the dataframe by user, and later to select
-  # the last item for a user to be used in validation.
-  tf.logging.info("Sorting by user, timestamp...")
-
-  if match_mlperf:
-    # This sort is equivalent to the non-MLPerf sort, except that the order of
-    # items with the same user and timestamp are sometimes different. For some
-    # reason, this sort results in a better hit-rate during evaluation, matching
-    # the performance of the MLPerf reference implementation.
-    df.sort_values(by=movielens.TIMESTAMP_COLUMN, inplace=True)
-    df.sort_values([movielens.USER_COLUMN, movielens.TIMESTAMP_COLUMN],
-                   inplace=True, kind="mergesort")
-  else:
-    df.sort_values([movielens.USER_COLUMN, movielens.TIMESTAMP_COLUMN],
-                   inplace=True)
-
-  df = df.reset_index()  # The dataframe does not reconstruct indicies in the
-  # sort or filter steps.
-
-  return df, user_map, item_map
-
-
-def _train_eval_map_fn(args):
-  """Split training and testing data and generate testing negatives.
-
-  This function is called as part of a multiprocessing map. The principle
-  input is a shard, which contains a sorted array of users and corresponding
-  items for each user, where items have already been sorted in ascending order
-  by timestamp. (Timestamp is not passed to avoid the serialization cost of
-  sending it to the map function.)
-
-  For each user, all but the last item is written into a pickle file which the
-  training data producer can consume on as needed. The last item for a user
-  is a validation point; it is written under a separate key and will be used
-  later to generate the evaluation data.
-
-  Args:
-    shard: A dict containing the user and item arrays.
-    shard_id: The id of the shard provided. This is used to number the training
-      shard pickle files.
-    num_items: The cardinality of the item set, which determines the set from
-      which validation negatives should be drawn.
-    cache_paths: rconst.Paths object containing locations for various cache
-      files.
-
-  """
-
-  shard, shard_id, num_items, cache_paths = args
-
-  users = shard[movielens.USER_COLUMN]
-  items = shard[movielens.ITEM_COLUMN]
-
-  # This produces index boundaries which can be used to slice by user.
-  delta = users[1:] - users[:-1]
-  boundaries = ([0] + (np.argwhere(delta)[:, 0] + 1).tolist() +
-                [users.shape[0]])
-
-  train_blocks = []
-  test_positives = []
-  for i in range(len(boundaries) - 1):
-    # This is simply a vector of repeated values such that the shard could be
-    # represented compactly with a tuple of tuples:
-    #   ((user_id, items), (user_id, items), ...)
-    # rather than:
-    #   user_id_vector, item_id_vector
-    # However the additional nested structure significantly increases the
-    # serialization and deserialization cost such that it is not worthwhile.
-    block_user = users[boundaries[i]:boundaries[i+1]]
-    assert len(set(block_user)) == 1
-
-    block_items = items[boundaries[i]:boundaries[i+1]]
-    train_blocks.append((block_user[:-1], block_items[:-1]))
-    test_positives.append((block_user[0], block_items[-1]))
-
-  train_users = np.concatenate([i[0] for i in train_blocks])
-  train_items = np.concatenate([i[1] for i in train_blocks])
-
-  test_pos_users = np.array([i[0] for i in test_positives],
-                            dtype=train_users.dtype)
-  test_pos_items = np.array([i[1] for i in test_positives],
-                            dtype=train_items.dtype)
-
-  train_shard_fpath = cache_paths.train_shard_template.format(
-      str(shard_id).zfill(5))
-
-  with tf.gfile.Open(train_shard_fpath, "wb") as f:
-    pickle.dump({
-        rconst.TRAIN_KEY: {
-            movielens.USER_COLUMN: train_users,
-            movielens.ITEM_COLUMN: train_items,
-        },
-        rconst.EVAL_KEY: {
-            movielens.USER_COLUMN: test_pos_users,
-            movielens.ITEM_COLUMN: test_pos_items,
-        }
-    }, f)
-
-
-def generate_train_eval_data(df, approx_num_shards, num_items, cache_paths,
-                             match_mlperf):
-  # type: (pd.DataFrame, int, int, rconst.Paths, bool) -> None
-  """Construct training and evaluation datasets.
-
-  This function manages dataset construction and validation that the
-  transformations have produced correct results. The particular logic of
-  transforming the data is performed in _train_eval_map_fn().
-
-  Args:
-    df: The dataframe containing the entire dataset. It is essential that this
-      dataframe be produced by _filter_index_sort(), as subsequent
-      transformations rely on `df` having particular structure.
-    approx_num_shards: The approximate number of similarly sized shards to
-      construct from `df`. The MovieLens has severe imbalances where some users
-      have interacted with many items; this is common among datasets involving
-      user data. Rather than attempt to aggressively balance shard size, this
-      function simply allows shards to "overflow" which can produce a number of
-      shards which is less than `approx_num_shards`. This small degree of
-      imbalance does not impact performance; however it does mean that one
-      should not expect approx_num_shards to be the ACTUAL number of shards.
-    num_items: The cardinality of the item set.
-    cache_paths: rconst.Paths object containing locations for various cache
-      files.
-    match_mlperf: If True, sample eval negative with replacements, which the
-      MLPerf reference implementation does.
-  """
+  valid_cache = tf.gfile.Exists(cache_path)
+  if valid_cache:
+    with tf.gfile.Open(cache_path, "rb") as f:
+      cached_data = pickle.load(f)

-  num_rows = len(df)
-  approximate_partitions = np.linspace(
-      0, num_rows, approx_num_shards + 1).astype("int")
-  start_ind, end_ind = 0, 0
-  shards = []
+    cache_age = time.time() - cached_data.get("create_time", 0)
+    if cache_age > rconst.CACHE_INVALIDATION_SEC:
+      valid_cache = False

-  for i in range(1, approx_num_shards + 1):
-    end_ind = approximate_partitions[i]
-    while (end_ind < num_rows and df[movielens.USER_COLUMN][end_ind - 1] ==
-           df[movielens.USER_COLUMN][end_ind]):
-      end_ind += 1
+    for key in _EXPECTED_CACHE_KEYS:
+      if key not in cached_data:
+        valid_cache = False

-    if end_ind <= start_ind:
-      continue  # imbalance from prior shard.
-
-    df_shard = df[start_ind:end_ind]
-    user_shard = df_shard[movielens.USER_COLUMN].values.astype(np.int32)
-    item_shard = df_shard[movielens.ITEM_COLUMN].values.astype(np.uint16)
-
-    shards.append({
-        movielens.USER_COLUMN: user_shard,
-        movielens.ITEM_COLUMN: item_shard,
-    })
-
-    start_ind = end_ind
-  assert end_ind == num_rows
-  approx_num_shards = len(shards)
-
-  tf.logging.info("Splitting train and test data and generating {} test "
-                  "negatives per user...".format(rconst.NUM_EVAL_NEGATIVES))
-  tf.gfile.MakeDirs(cache_paths.train_shard_subdir)
-
-  map_args = [(shards[i], i, num_items, cache_paths)
-              for i in range(approx_num_shards)]
-
-  with popen_helper.get_pool(multiprocessing.cpu_count()) as pool:
-    pool.map(_train_eval_map_fn, map_args)  # pylint: disable=no-member
+    if not valid_cache:
+      tf.logging.info("Removing stale raw data cache file.")
+      tf.gfile.Remove(cache_path)

+  if valid_cache:
+    data = cached_data
+  else:
+    with tf.gfile.Open(raw_rating_path) as f:
+      df = pd.read_csv(f)
+
+    # Get the info of users who have more than 20 ratings on items
+    grouped = df.groupby(movielens.USER_COLUMN)
+    df = grouped.filter(
+        lambda x: len(x) >= rconst.MIN_NUM_RATINGS) # type: pd.DataFrame
+
+    original_users = df[movielens.USER_COLUMN].unique()
+    original_items = df[movielens.ITEM_COLUMN].unique()
+
+    # Map the ids of user and item to 0 based index for following processing
+    tf.logging.info("Generating user_map and item_map...")
+    user_map = {user: index for index, user in enumerate(original_users)}
+    item_map = {item: index for index, item in enumerate(original_items)}
+
+    df[movielens.USER_COLUMN] = df[movielens.USER_COLUMN].apply(
+        lambda user: user_map[user])
+    df[movielens.ITEM_COLUMN] = df[movielens.ITEM_COLUMN].apply(
+        lambda item: item_map[item])
+
+    num_users = len(original_users)
+    num_items = len(original_items)
+
+    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.PREPROC_HP_NUM_EVAL,
+                            value=rconst.NUM_EVAL_NEGATIVES)
+
+    assert num_users <= np.iinfo(rconst.USER_DTYPE).max
+    assert num_items <= np.iinfo(rconst.ITEM_DTYPE).max
+    assert df[movielens.USER_COLUMN].max() == num_users - 1
+    assert df[movielens.ITEM_COLUMN].max() == num_items - 1
+
+    # This sort is used to shard the dataframe by user, and later to select
+    # the last item for a user to be used in validation.
+    tf.logging.info("Sorting by user, timestamp...")
+
+    # This sort is equivalent to
+    #   df.sort_values([movielens.USER_COLUMN, movielens.TIMESTAMP_COLUMN],
+    #   inplace=True)
+    # except that the order of items with the same user and timestamp are
+    # sometimes different. For some reason, this sort results in a better
+    # hit-rate during evaluation, matching the performance of the MLPerf
+    # reference implementation.
+    df.sort_values(by=movielens.TIMESTAMP_COLUMN, inplace=True)
+    df.sort_values([movielens.USER_COLUMN, movielens.TIMESTAMP_COLUMN],
+                   inplace=True, kind="mergesort")

-def construct_cache(dataset, data_dir, num_data_readers, match_mlperf,
-                    deterministic, cache_id=None):
-  # type: (str, str, int, bool, bool, typing.Optional[int]) -> NCFDataset
+    df = df.reset_index()  # The dataframe does not reconstruct indices in the
+                           # sort or filter steps.
+
+    grouped = df.groupby(movielens.USER_COLUMN, group_keys=False)
+    eval_df, train_df = grouped.tail(1), grouped.apply(lambda x: x.iloc[:-1])
+
+    data = {
+        rconst.TRAIN_USER_KEY: train_df[movielens.USER_COLUMN]
+                               .values.astype(rconst.USER_DTYPE),
+        rconst.TRAIN_ITEM_KEY: train_df[movielens.ITEM_COLUMN]
+                               .values.astype(rconst.ITEM_DTYPE),
+        rconst.EVAL_USER_KEY: eval_df[movielens.USER_COLUMN]
+                              .values.astype(rconst.USER_DTYPE),
+        rconst.EVAL_ITEM_KEY: eval_df[movielens.ITEM_COLUMN]
+                              .values.astype(rconst.ITEM_DTYPE),
+        rconst.USER_MAP: user_map,
+        rconst.ITEM_MAP: item_map,
+        "create_time": time.time(),
+    }
+
+    tf.logging.info("Writing raw data cache.")
+    with tf.gfile.Open(cache_path, "wb") as f:
+      pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)
+
+  # TODO(robieta): MLPerf cache clear.
+  return data, valid_cache
+
+
+def instantiate_pipeline(dataset, data_dir, params, constructor_type=None,
+                         deterministic=False):
+  # type: (str, str, dict, typing.Optional[str], bool) -> (NCFDataset, typing.Callable)
  """Load and digest data CSV into a usable form.

  Args:
    dataset: The name of the dataset to be used.
    data_dir: The root directory of the dataset.
-    num_data_readers: The number of parallel processes which will request
-      data during training.
-    match_mlperf: If True, change the behavior of the cache construction to
-      match the MLPerf reference implementation.
-    deterministic: Try to enforce repeatable behavior, even at the cost of
-      performance.
+    params: dict of parameters for the run.
+    constructor_type: The name of the constructor subclass that should be used
+      for the input pipeline.
+    deterministic: Tell the data constructor to produce deterministically.
  """
-  cache_paths = rconst.Paths(data_dir=data_dir, cache_id=cache_id)
-  num_data_readers = (num_data_readers or int(multiprocessing.cpu_count() / 2)
-                      or 1)
-  approx_num_shards = int(movielens.NUM_RATINGS[dataset]
-                          // rconst.APPROX_PTS_PER_TRAIN_SHARD) or 1
+  tf.logging.info("Beginning data preprocessing.")

  st = timeit.default_timer()
-  cache_root = os.path.join(data_dir, cache_paths.cache_root)
-  if tf.gfile.Exists(cache_root):
-    raise ValueError("{} unexpectedly already exists."
-                     .format(cache_paths.cache_root))
-  tf.logging.info("Creating cache directory. This should be deleted on exit.")
-  tf.gfile.MakeDirs(cache_paths.cache_root)
-
  raw_rating_path = os.path.join(data_dir, dataset, movielens.RATINGS_FILE)
-  df, user_map, item_map = _filter_index_sort(raw_rating_path, match_mlperf)
+  cache_path = os.path.join(data_dir, dataset, rconst.RAW_CACHE_FILE)
+
+  raw_data, _ = _filter_index_sort(raw_rating_path, cache_path)
+  user_map, item_map = raw_data["user_map"], raw_data["item_map"]
  num_users, num_items = DATASET_TO_NUM_USERS_AND_ITEMS[dataset]

  if num_users != len(user_map):
@@ -367,387 +205,28 @@ def construct_cache(dataset, data_dir, num_data_readers, match_mlperf,
    raise ValueError("Expected to find {} items, but found {}".format(
        num_items, len(item_map)))

-  generate_train_eval_data(df=df, approx_num_shards=approx_num_shards,
-                           num_items=len(item_map), cache_paths=cache_paths,
-                           match_mlperf=match_mlperf)
-  del approx_num_shards  # value may have changed.
-
-  ncf_dataset = NCFDataset(user_map=user_map, item_map=item_map,
-                           num_data_readers=num_data_readers,
-                           cache_paths=cache_paths,
-                           num_train_positives=len(df) - len(user_map),
-                           deterministic=deterministic)
+  producer = data_pipeline.get_constructor(constructor_type or "materialized")(
+      maximum_number_epochs=params["train_epochs"],
+      num_users=num_users,
+      num_items=num_items,
+      user_map=user_map,
+      item_map=item_map,
+      train_pos_users=raw_data[rconst.TRAIN_USER_KEY],
+      train_pos_items=raw_data[rconst.TRAIN_ITEM_KEY],
+      train_batch_size=params["batch_size"],
+      batches_per_train_step=params["batches_per_step"],
+      num_train_negatives=params["num_neg"],
+      eval_pos_users=raw_data[rconst.EVAL_USER_KEY],
+      eval_pos_items=raw_data[rconst.EVAL_ITEM_KEY],
+      eval_batch_size=params["eval_batch_size"],
+      batches_per_eval_step=params["batches_per_step"],
+      stream_files=params["use_tpu"],
+      deterministic=deterministic
+  )

  run_time = timeit.default_timer() - st
-  tf.logging.info("Cache construction complete. Time: {:.1f} sec."
+  tf.logging.info("Data preprocessing complete. Time: {:.1f} sec."
                  .format(run_time))

-  return ncf_dataset
-
-
-def _shutdown(proc):
-  # type: (subprocess.Popen) -> None
-  """Convenience function to cleanly shut down async generation process."""
-
-  tf.logging.info("Shutting down train data creation subprocess.")
-  try:
-    try:
-      proc.send_signal(signal.SIGINT)
-      time.sleep(5)
-      if proc.poll() is not None:
-        tf.logging.info("Train data creation subprocess ended")
-        return  # SIGINT was handled successfully within 5 seconds
-
-    except socket.error:
-      pass
-
-    # Otherwise another second of grace period and then force kill the process.
-    time.sleep(1)
-    proc.terminate()
-    tf.logging.info("Train data creation subprocess killed")
-  except:  # pylint: disable=broad-except
-    tf.logging.error("Data generation subprocess could not be killed.")
-
-
-
-def write_flagfile(flags_, ncf_dataset):
-  """Write flagfile to begin async data generation."""
-  if ncf_dataset.deterministic:
-    flags_["seed"] = stat_utils.random_int32()
-
-  # We write to a temp file then atomically rename it to the final file,
-  # because writing directly to the final file can cause the data generation
-  # async process to read a partially written JSON file.
-  flagfile_temp = os.path.join(ncf_dataset.cache_paths.cache_root,
-                               rconst.FLAGFILE_TEMP)
-  tf.logging.info("Preparing flagfile for async data generation in {} ..."
-                  .format(flagfile_temp))
-  with tf.gfile.Open(flagfile_temp, "w") as f:
-    for k, v in six.iteritems(flags_):
-      f.write("--{}={}\n".format(k, v))
-  flagfile = os.path.join(ncf_dataset.cache_paths.cache_root, rconst.FLAGFILE)
-  tf.gfile.Rename(flagfile_temp, flagfile)
-  tf.logging.info(
-      "Wrote flagfile for async data generation in {}.".format(flagfile))
-
-def instantiate_pipeline(dataset, data_dir, batch_size, eval_batch_size,
-                         num_cycles, num_data_readers=None, num_neg=4,
-                         epochs_per_cycle=1, match_mlperf=False,
-                         deterministic=False, use_subprocess=True,
-                         cache_id=None):
-  # type: (...) -> (NCFDataset, typing.Callable)
-  """Preprocess data and start negative generation subprocess."""
-
-  tf.logging.info("Beginning data preprocessing.")
-  tf.gfile.MakeDirs(data_dir)
-  ncf_dataset = construct_cache(dataset=dataset, data_dir=data_dir,
-                                num_data_readers=num_data_readers,
-                                match_mlperf=match_mlperf,
-                                deterministic=deterministic,
-                                cache_id=cache_id)
-  # By limiting the number of workers we guarantee that the worker
-  # pool underlying the training generation doesn't starve other processes.
-  num_workers = int(multiprocessing.cpu_count() * 0.75) or 1
-
-  flags_ = {
-      "data_dir": data_dir,
-      "cache_id": ncf_dataset.cache_paths.cache_id,
-      "num_neg": num_neg,
-      "num_train_positives": ncf_dataset.num_train_positives,
-      "num_items": ncf_dataset.num_items,
-      "num_users": ncf_dataset.num_users,
-      "num_readers": ncf_dataset.num_data_readers,
-      "epochs_per_cycle": epochs_per_cycle,
-      "num_cycles": num_cycles,
-      "train_batch_size": batch_size,
-      "eval_batch_size": eval_batch_size,
-      "num_workers": num_workers,
-      "redirect_logs": use_subprocess,
-      "use_tf_logging": not use_subprocess,
-      "ml_perf": match_mlperf,
-      "output_ml_perf_compliance_logging": mlperf_helper.LOGGER.enabled,
-  }
-
-  if use_subprocess:
-    tf.logging.info("Creating training file subprocess.")
-    subproc_env = os.environ.copy()
-    # The subprocess uses TensorFlow for tf.gfile, but it does not need GPU
-    # resources and by default will try to allocate GPU memory. This would cause
-    # contention with the main training process.
-    subproc_env["CUDA_VISIBLE_DEVICES"] = ""
-    subproc_args = popen_helper.INVOCATION + [
-        "--data_dir", data_dir,
-        "--cache_id", str(ncf_dataset.cache_paths.cache_id)]
-    tf.logging.info(
-        "Generation subprocess command: {}".format(" ".join(subproc_args)))
-    proc = subprocess.Popen(args=subproc_args, shell=False, env=subproc_env)
-
-  cleanup_called = {"finished": False}
-  @atexit.register
-  def cleanup():
-    """Remove files and subprocess from data generation."""
-    if cleanup_called["finished"]:
-      return
-
-    if use_subprocess:
-      _shutdown(proc)
-
-    try:
-      tf.gfile.DeleteRecursively(ncf_dataset.cache_paths.cache_root)
-    except tf.errors.NotFoundError:
-      pass
-
-    cleanup_called["finished"] = True
-
-  for _ in range(300):
-    if tf.gfile.Exists(ncf_dataset.cache_paths.subproc_alive):
-      break
-    time.sleep(1)  # allow `alive` file to be written
-  if not tf.gfile.Exists(ncf_dataset.cache_paths.subproc_alive):
-    raise ValueError("Generation subprocess did not start correctly. Data will "
-                     "not be available; exiting to avoid waiting forever.")
-
-  # We start the async process and wait for it to signal that it is alive. It
-  # will then enter a loop waiting for the flagfile to be written. Once we see
-  # that the async process has signaled that it is alive, we clear the system
-  # caches and begin the run.
-  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.RUN_CLEAR_CACHES)
-  mlperf_helper.clear_system_caches()
-  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.RUN_START)
-  write_flagfile(flags_, ncf_dataset)
-
-  return ncf_dataset, cleanup
-
-
-def make_deserialize(params, batch_size, training=False):
-  """Construct deserialize function for training and eval fns."""
-  feature_map = {
-      movielens.USER_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
-      movielens.ITEM_COLUMN: tf.FixedLenFeature([], dtype=tf.string),
-  }
-  if training:
-    feature_map["labels"] = tf.FixedLenFeature([], dtype=tf.string)
-  else:
-    feature_map[rconst.DUPLICATE_MASK] = tf.FixedLenFeature([], dtype=tf.string)
-
-  def deserialize(examples_serialized):
-    """Called by Dataset.map() to convert batches of records to tensors."""
-    features = tf.parse_single_example(examples_serialized, feature_map)
-    users = tf.reshape(tf.decode_raw(
-        features[movielens.USER_COLUMN], tf.int32), (batch_size,))
-    items = tf.reshape(tf.decode_raw(
-        features[movielens.ITEM_COLUMN], tf.uint16), (batch_size,))
-
-    if params["use_tpu"] or params["use_xla_for_gpu"]:
-      items = tf.cast(items, tf.int32)  # TPU and XLA disallows uint16 infeed.
-
-    if not training:
-      dupe_mask = tf.reshape(tf.cast(tf.decode_raw(
-          features[rconst.DUPLICATE_MASK], tf.int8), tf.bool), (batch_size,))
-      return {
-          movielens.USER_COLUMN: users,
-          movielens.ITEM_COLUMN: items,
-          rconst.DUPLICATE_MASK: dupe_mask,
-      }
-
-    labels = tf.reshape(tf.cast(tf.decode_raw(
-        features["labels"], tf.int8), tf.bool), (batch_size,))
-
-    return {
-        movielens.USER_COLUMN: users,
-        movielens.ITEM_COLUMN: items,
-    }, labels
-  return deserialize
-
-
-def hash_pipeline(dataset, deterministic):
-  # type: (tf.data.Dataset, bool) -> None
-  """Utility function for detecting non-determinism in the data pipeline.
-
-  Args:
-    dataset: a tf.data.Dataset generated by the input_fn
-    deterministic: Does the input_fn expect the dataset to be deterministic.
-      (i.e. fixed seed, sloppy=False, etc.)
-  """
-  if not deterministic:
-    tf.logging.warning("Data pipeline is not marked as deterministic. Hash "
-                       "values are not expected to be meaningful.")
-
-  batch = dataset.make_one_shot_iterator().get_next()
-  md5 = hashlib.md5()
-  count = 0
-  first_batch_hash = b""
-  with tf.Session() as sess:
-    while True:
-      try:
-        result = sess.run(batch)
-        if isinstance(result, tuple):
-          result = result[0]  # only hash features
-      except tf.errors.OutOfRangeError:
-        break
-
-      count += 1
-      md5.update(memoryview(result[movielens.USER_COLUMN]).tobytes())
-      md5.update(memoryview(result[movielens.ITEM_COLUMN]).tobytes())
-      if count == 1:
-        first_batch_hash = md5.hexdigest()
-  overall_hash = md5.hexdigest()
-  tf.logging.info("Batch count: {}".format(count))
-  tf.logging.info("  [pipeline_hash] First batch hash: {}".format(
-      first_batch_hash))
-  tf.logging.info("  [pipeline_hash] All batches hash: {}".format(overall_hash))
-
-
-def make_input_fn(
-    ncf_dataset,       # type: typing.Optional[NCFDataset]
-    is_training,       # type: bool
-    record_files=None  # type: typing.Optional[tf.Tensor]
-    ):
-  # type: (...) -> (typing.Callable, str, int)
-  """Construct training input_fn for the current epoch."""
-
-  if ncf_dataset is None:
-    return make_synthetic_input_fn(is_training)
-
-  if record_files is not None:
-    epoch_metadata = None
-    batch_count = None
-    record_dir = None
-  else:
-    epoch_metadata, record_dir, template = get_epoch_info(is_training,
-                                                          ncf_dataset)
-    record_files = os.path.join(record_dir, template.format("*"))
-    # This value is used to check that the batch count from the subprocess
-    # matches the batch count expected by the main thread.
-    batch_count = epoch_metadata["batch_count"]
-
-
-  def input_fn(params):
-    """Generated input_fn for the given epoch."""
-    if is_training:
-      batch_size = params["batch_size"]
-    else:
-      # Estimator has "eval_batch_size" included in the params, but TPUEstimator
-      # populates "batch_size" to the appropriate value.
-      batch_size = params.get("eval_batch_size") or params["batch_size"]
-
-    if epoch_metadata and epoch_metadata["batch_size"] != batch_size:
-      raise ValueError(
-          "Records were constructed with batch size {}, but input_fn was given "
-          "a batch size of {}. This will result in a deserialization error in "
-          "tf.parse_single_example."
-          .format(epoch_metadata["batch_size"], batch_size))
-    record_files_ds = tf.data.Dataset.list_files(record_files, shuffle=False)
-
-    interleave = tf.data.experimental.parallel_interleave(
-        tf.data.TFRecordDataset,
-        cycle_length=4,
-        block_length=100000,
-        sloppy=not ncf_dataset.deterministic,
-        prefetch_input_elements=4,
-    )
-
-    deserialize = make_deserialize(params, batch_size, is_training)
-    dataset = record_files_ds.apply(interleave)
-    dataset = dataset.map(deserialize, num_parallel_calls=4)
-    dataset = dataset.prefetch(32)
-
-    if params.get("hash_pipeline"):
-      hash_pipeline(dataset, ncf_dataset.deterministic)
-
-    return dataset
-
-  return input_fn, record_dir, batch_count
-
-
-def _check_subprocess_alive(ncf_dataset, directory):
-  if (not tf.gfile.Exists(ncf_dataset.cache_paths.subproc_alive) and
-      not tf.gfile.Exists(directory)):
-    # The generation subprocess must have been alive at some point, because we
-    # earlier checked that the subproc_alive file existed.
-    raise ValueError("Generation subprocess unexpectedly died. Data will not "
-                     "be available; exiting to avoid waiting forever.")
-
-
-
-def get_epoch_info(is_training, ncf_dataset):
-  """Wait for the epoch input data to be ready and return various info about it.
-
-  Args:
-    is_training: If we should return info for a training or eval epoch.
-    ncf_dataset: An NCFDataset.
-
-  Returns:
-    epoch_metadata: A dict with epoch metadata.
-    record_dir: The directory with the TFRecord files storing the input data.
-    template: A string template of the files in `record_dir`.
-      `template.format('*')` is a glob that matches all the record files.
-  """
-  if is_training:
-    train_epoch_dir = ncf_dataset.cache_paths.train_epoch_dir
-    _check_subprocess_alive(ncf_dataset, train_epoch_dir)
-
-    while not tf.gfile.Exists(train_epoch_dir):
-      tf.logging.info("Waiting for {} to exist.".format(train_epoch_dir))
-      time.sleep(1)
-
-    train_data_dirs = tf.gfile.ListDirectory(train_epoch_dir)
-    while not train_data_dirs:
-      tf.logging.info("Waiting for data folder to be created.")
-      time.sleep(1)
-      train_data_dirs = tf.gfile.ListDirectory(train_epoch_dir)
-    train_data_dirs.sort()  # names are zfilled so that
-                            # lexicographic sort == numeric sort
-    record_dir = os.path.join(train_epoch_dir, train_data_dirs[0])
-    template = rconst.TRAIN_RECORD_TEMPLATE
-  else:
-    record_dir = ncf_dataset.cache_paths.eval_data_subdir
-    _check_subprocess_alive(ncf_dataset, record_dir)
-    template = rconst.EVAL_RECORD_TEMPLATE
-
-  ready_file = os.path.join(record_dir, rconst.READY_FILE)
-  while not tf.gfile.Exists(ready_file):
-    tf.logging.info("Waiting for records in {} to be ready".format(record_dir))
-    time.sleep(1)
-
-  with tf.gfile.Open(ready_file, "r") as f:
-    epoch_metadata = json.load(f)
-  return epoch_metadata, record_dir, template
-
-
-def make_synthetic_input_fn(is_training):
-  """Construct training input_fn that uses synthetic data."""
-  def input_fn(params):
-    """Generated input_fn for the given epoch."""
-    batch_size = (params["batch_size"] if is_training else
-                  params["eval_batch_size"] or params["batch_size"])
-    num_users = params["num_users"]
-    num_items = params["num_items"]
-
-    users = tf.random_uniform([batch_size], dtype=tf.int32, minval=0,
-                              maxval=num_users)
-    items = tf.random_uniform([batch_size], dtype=tf.int32, minval=0,
-                              maxval=num_items)
-
-    if is_training:
-      labels = tf.random_uniform([batch_size], dtype=tf.int32, minval=0,
-                                 maxval=2)
-      data = {
-          movielens.USER_COLUMN: users,
-          movielens.ITEM_COLUMN: items,
-      }, labels
-    else:
-      dupe_mask = tf.cast(tf.random_uniform([batch_size], dtype=tf.int32,
-                                            minval=0, maxval=2), tf.bool)
-      data = {
-          movielens.USER_COLUMN: users,
-          movielens.ITEM_COLUMN: items,
-          rconst.DUPLICATE_MASK: dupe_mask,
-      }
-
-    dataset = tf.data.Dataset.from_tensors(data).repeat(
-        SYNTHETIC_BATCHES_PER_EPOCH)
-    dataset = dataset.prefetch(32)
-    return dataset
-
-  return input_fn, None, SYNTHETIC_BATCHES_PER_EPOCH
+  print(producer)
+  return num_users, num_items, producer
--- a/official/recommendation/data_test.py
+++ b/official/recommendation/data_test.py
@@ -18,19 +18,19 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+from collections import defaultdict
+import hashlib
 import os
-import pickle
-import time

+import mock
 import numpy as np
-import pandas as pd
+import scipy.stats
 import tensorflow as tf

 from official.datasets import movielens
 from official.recommendation import constants as rconst
-from official.recommendation import data_async_generation
 from official.recommendation import data_preprocessing
-from official.recommendation import stat_utils
+from official.recommendation import popen_helper


 DATASET = "ml-test"
@@ -42,10 +42,18 @@ EVAL_BATCH_SIZE = 4000
 NUM_NEG = 4


+END_TO_END_TRAIN_MD5 = "b218738e915e825d03939c5e305a2698"
+END_TO_END_EVAL_MD5 = "d753d0f3186831466d6e218163a9501e"
+FRESH_RANDOMNESS_MD5 = "63d0dff73c0e5f1048fbdc8c65021e22"
+
+
 def mock_download(*args, **kwargs):
  return

-
+# The forkpool used by data producers interacts badly with the threading
+# used by TestCase. Without this patch tests will hang, and no amount
+# of diligent closing and joining within the producer will prevent it.
+@mock.patch.object(popen_helper, "get_forkpool", popen_helper.get_fauxpool)
 class BaseTest(tf.test.TestCase):
  def setUp(self):
    self.temp_data_dir = self.get_temp_dir()
@@ -65,10 +73,10 @@ class BaseTest(tf.test.TestCase):
    scores = np.random.randint(low=0, high=5, size=NUM_PTS)
    times = np.random.randint(low=1000000000, high=1200000000, size=NUM_PTS)

-    rating_file = os.path.join(ratings_folder, movielens.RATINGS_FILE)
+    self.rating_file = os.path.join(ratings_folder, movielens.RATINGS_FILE)
    self.seen_pairs = set()
    self.holdout = {}
-    with tf.gfile.Open(rating_file, "w") as f:
+    with tf.gfile.Open(self.rating_file, "w") as f:
      f.write("user_id,item_id,rating,timestamp\n")
      for usr, itm, scr, ts in zip(users, items, scores, times):
        pair = (usr, itm)
@@ -85,21 +93,29 @@ class BaseTest(tf.test.TestCase):
    data_preprocessing.DATASET_TO_NUM_USERS_AND_ITEMS[DATASET] = (NUM_USERS,
                                                                  NUM_ITEMS)

+  def make_params(self, train_epochs=1):
+    return {
+        "train_epochs": train_epochs,
+        "batches_per_step": 1,
+        "use_seed": False,
+        "batch_size": BATCH_SIZE,
+        "eval_batch_size": EVAL_BATCH_SIZE,
+        "num_neg": NUM_NEG,
+        "match_mlperf": True,
+        "use_tpu": False,
+        "use_xla_for_gpu": False,
+    }
+
  def test_preprocessing(self):
    # For the most part the necessary checks are performed within
-    # construct_cache()
-    ncf_dataset = data_preprocessing.construct_cache(
-        dataset=DATASET, data_dir=self.temp_data_dir, num_data_readers=2,
-        match_mlperf=False, deterministic=False)
-    assert ncf_dataset.num_users == NUM_USERS
-    assert ncf_dataset.num_items == NUM_ITEMS
-
-    time.sleep(1)  # Ensure we create the next cache in a new directory.
-    ncf_dataset = data_preprocessing.construct_cache(
-        dataset=DATASET, data_dir=self.temp_data_dir, num_data_readers=2,
-        match_mlperf=True, deterministic=False)
-    assert ncf_dataset.num_users == NUM_USERS
-    assert ncf_dataset.num_items == NUM_ITEMS
+    # _filter_index_sort()
+
+    cache_path = os.path.join(self.temp_data_dir, "test_cache.pickle")
+    data, valid_cache = data_preprocessing._filter_index_sort(
+        self.rating_file, cache_path=cache_path)
+
+    assert len(data[rconst.USER_MAP]) == NUM_USERS
+    assert len(data[rconst.ITEM_MAP]) == NUM_ITEMS

  def drain_dataset(self, dataset, g):
    # type: (tf.data.Dataset, tf.Graph) -> list
@@ -114,29 +130,46 @@ class BaseTest(tf.test.TestCase):
          break
    return output

-  def test_end_to_end(self):
-    ncf_dataset, _ = data_preprocessing.instantiate_pipeline(
-        dataset=DATASET, data_dir=self.temp_data_dir,
-        batch_size=BATCH_SIZE, eval_batch_size=EVAL_BATCH_SIZE,
-        num_cycles=1, num_data_readers=2, num_neg=NUM_NEG)
+  def _test_end_to_end(self, constructor_type):
+    params = self.make_params(train_epochs=1)
+    _, _, producer = data_preprocessing.instantiate_pipeline(
+        dataset=DATASET, data_dir=self.temp_data_dir, params=params,
+        constructor_type=constructor_type, deterministic=True)
+
+    producer.start()
+    producer.join()
+    assert producer._fatal_exception is None
+
+    user_inv_map = {v: k for k, v in producer.user_map.items()}
+    item_inv_map = {v: k for k, v in producer.item_map.items()}

+    # ==========================================================================
+    # == Training Data =========================================================
+    # ==========================================================================
    g = tf.Graph()
    with g.as_default():
-      input_fn, record_dir, batch_count = \
-        data_preprocessing.make_input_fn(ncf_dataset, True)
-      dataset = input_fn({"batch_size": BATCH_SIZE, "use_tpu": False,
-                          "use_xla_for_gpu": False})
+      input_fn = producer.make_input_fn(is_training=True)
+      dataset = input_fn(params)
+
    first_epoch = self.drain_dataset(dataset=dataset, g=g)
-    user_inv_map = {v: k for k, v in ncf_dataset.user_map.items()}
-    item_inv_map = {v: k for k, v in ncf_dataset.item_map.items()}

+    counts = defaultdict(int)
    train_examples = {
        True: set(),
        False: set(),
    }
+
+    md5 = hashlib.md5()
    for features, labels in first_epoch:
-      for u, i, l in zip(features[movielens.USER_COLUMN],
-                         features[movielens.ITEM_COLUMN], labels):
+      data_list = [
+          features[movielens.USER_COLUMN], features[movielens.ITEM_COLUMN],
+          features[rconst.VALID_POINT_MASK], labels]
+      for i in data_list:
+        md5.update(i.tobytes())
+
+      for u, i, v, l in zip(*data_list):
+        if not v:
+          continue  # ignore padding

        u_raw = user_inv_map[u]
        i_raw = item_inv_map[i]
@@ -145,61 +178,166 @@ class BaseTest(tf.test.TestCase):
          # generation, so it will occasionally appear as a negative example
          # during training.
          assert not l
-          assert i_raw == self.holdout[u_raw][1]
+          self.assertEqual(i_raw, self.holdout[u_raw][1])
        train_examples[l].add((u_raw, i_raw))
-    num_positives_seen = len(train_examples[True])
+        counts[(u_raw, i_raw)] += 1
+
+    self.assertRegexpMatches(md5.hexdigest(), END_TO_END_TRAIN_MD5)

-    assert ncf_dataset.num_train_positives == num_positives_seen
+    num_positives_seen = len(train_examples[True])
+    self.assertEqual(producer._train_pos_users.shape[0], num_positives_seen)

    # This check is more heuristic because negatives are sampled with
    # replacement. It only checks that negative generation is reasonably random.
-    assert len(train_examples[False]) / NUM_NEG / num_positives_seen > 0.9
-
-  def test_shard_randomness(self):
-    users = [0, 0, 0, 0, 1, 1, 1, 1]
-    items = [0, 2, 4, 6, 0, 2, 4, 6]
-    times = [1, 2, 3, 4, 1, 2, 3, 4]
-    df = pd.DataFrame({movielens.USER_COLUMN: users,
-                       movielens.ITEM_COLUMN: items,
-                       movielens.TIMESTAMP_COLUMN: times})
-    cache_paths = rconst.Paths(data_dir=self.temp_data_dir)
-    np.random.seed(1)
-
-    num_shards = 2
-    num_items = 10
-    data_preprocessing.generate_train_eval_data(
-        df, approx_num_shards=num_shards, num_items=num_items,
-        cache_paths=cache_paths, match_mlperf=True)
-
-    raw_shards = tf.gfile.ListDirectory(cache_paths.train_shard_subdir)
-    assert len(raw_shards) == num_shards
-
-    sharded_eval_data = []
-    for i in range(2):
-      sharded_eval_data.append(data_async_generation._process_shard(
-          (os.path.join(cache_paths.train_shard_subdir, raw_shards[i]),
-           num_items, rconst.NUM_EVAL_NEGATIVES, stat_utils.random_int32(),
-           False, True)))
-
-    if sharded_eval_data[0][0][0] == 1:
-      # Order is not assured for this part of the pipeline.
-      sharded_eval_data.reverse()
-
-    eval_data = [np.concatenate([shard[i] for shard in sharded_eval_data])
-                 for i in range(3)]
-    eval_data = {
-        movielens.USER_COLUMN: eval_data[0],
-        movielens.ITEM_COLUMN: eval_data[1],
-    }
+    self.assertGreater(
+        len(train_examples[False]) / NUM_NEG / num_positives_seen, 0.9)
+
+    # This checks that the samples produced are independent by checking the
+    # number of duplicate entries. If workers are not properly independent there
+    # will be lots of repeated pairs.
+    self.assertLess(np.mean(list(counts.values())), 1.1)
+
+    # ==========================================================================
+    # == Eval Data =============================================================
+    # ==========================================================================
+    with g.as_default():
+      input_fn = producer.make_input_fn(is_training=False)
+      dataset = input_fn(params)
+
+    eval_data = self.drain_dataset(dataset=dataset, g=g)

-    eval_items_per_user = rconst.NUM_EVAL_NEGATIVES + 1
-    self.assertAllClose(eval_data[movielens.USER_COLUMN],
-                        [0] * eval_items_per_user + [1] * eval_items_per_user)
+    current_user = None
+    md5 = hashlib.md5()
+    for features in eval_data:
+      data_list = [
+          features[movielens.USER_COLUMN], features[movielens.ITEM_COLUMN],
+          features[rconst.DUPLICATE_MASK]]
+      for i in data_list:
+        md5.update(i.tobytes())

-    # Each shard process should generate different random items.
-    self.assertNotAllClose(
-        eval_data[movielens.ITEM_COLUMN][:eval_items_per_user],
-        eval_data[movielens.ITEM_COLUMN][eval_items_per_user:])
+      for idx, (u, i, d) in enumerate(zip(*data_list)):
+        u_raw = user_inv_map[u]
+        i_raw = item_inv_map[i]
+        if current_user is None:
+          current_user = u
+
+        # Ensure that users appear in blocks, as the evaluation logic expects
+        # this structure.
+        self.assertEqual(u, current_user)
+
+        # The structure of evaluation data is 999 negative examples followed
+        # by the holdout positive.
+        if not (idx + 1) % (rconst.NUM_EVAL_NEGATIVES + 1):
+          # Check that the last element in each chunk is the holdout item.
+          self.assertEqual(i_raw, self.holdout[u_raw][1])
+          current_user = None
+
+        elif i_raw == self.holdout[u_raw][1]:
+          # Because the holdout item is not given to the negative generation
+          # process, it can appear as a negative. In that case, it should be
+          # masked out as a duplicate. (Since the true positive is placed at
+          # the end and would therefore lose the tie.)
+          assert d
+
+        else:
+          # Otherwise check that the other 999 points for a user are selected
+          # from the negatives.
+          assert (u_raw, i_raw) not in self.seen_pairs
+
+    self.assertRegexpMatches(md5.hexdigest(), END_TO_END_EVAL_MD5)
+
+  def _test_fresh_randomness(self, constructor_type):
+    train_epochs = 5
+    params = self.make_params(train_epochs=train_epochs)
+    _, _, producer = data_preprocessing.instantiate_pipeline(
+        dataset=DATASET, data_dir=self.temp_data_dir, params=params,
+        constructor_type=constructor_type, deterministic=True)
+
+    producer.start()
+
+    results = []
+    g = tf.Graph()
+    with g.as_default():
+      for _ in range(train_epochs):
+        input_fn = producer.make_input_fn(is_training=True)
+        dataset = input_fn(params)
+        results.extend(self.drain_dataset(dataset=dataset, g=g))
+
+    producer.join()
+    assert producer._fatal_exception is None
+
+    positive_counts, negative_counts = defaultdict(int), defaultdict(int)
+    md5 = hashlib.md5()
+    for features, labels in results:
+      data_list = [
+          features[movielens.USER_COLUMN], features[movielens.ITEM_COLUMN],
+          features[rconst.VALID_POINT_MASK], labels]
+      for i in data_list:
+        md5.update(i.tobytes())
+
+      for u, i, v, l in zip(*data_list):
+        if not v:
+          continue  # ignore padding
+
+        if l:
+          positive_counts[(u, i)] += 1
+        else:
+          negative_counts[(u, i)] += 1
+
+    self.assertRegexpMatches(md5.hexdigest(), FRESH_RANDOMNESS_MD5)
+
+    # The positive examples should appear exactly once each epoch
+    self.assertAllEqual(list(positive_counts.values()),
+                        [train_epochs for _ in positive_counts])
+
+    # The threshold for the negatives is heuristic, but in general repeats are
+    # expected, but should not appear too frequently.
+
+    pair_cardinality = NUM_USERS * NUM_ITEMS
+    neg_pair_cardinality = pair_cardinality - len(self.seen_pairs)
+
+    # Approximation for the expectation number of times that a particular
+    # negative will appear in a given epoch. Implicit in this calculation is the
+    # treatment of all negative pairs as equally likely. Normally is not
+    # necessarily reasonable; however the generation in self.setUp() will
+    # approximate this behavior sufficiently for heuristic testing.
+    e_sample = len(self.seen_pairs) * NUM_NEG / neg_pair_cardinality
+
+    # The frequency of occurance of a given negative pair should follow an
+    # approximately binomial distribution in the limit that the cardinality of
+    # the negative pair set >> number of samples per epoch.
+    approx_pdf = scipy.stats.binom.pmf(k=np.arange(train_epochs+1),
+                                       n=train_epochs, p=e_sample)
+
+    # Tally the actual observed counts.
+    count_distribution = [0 for _ in range(train_epochs + 1)]
+    for i in negative_counts.values():
+      i = min([i, train_epochs])  # round down tail for simplicity.
+      count_distribution[i] += 1
+    count_distribution[0] = neg_pair_cardinality - sum(count_distribution[1:])
+
+    # Check that the frequency of negative pairs is approximately binomial.
+    for i in range(train_epochs + 1):
+      if approx_pdf[i] < 0.05:
+        continue  # Variance will be high at the tails.
+
+      observed_fraction = count_distribution[i] / neg_pair_cardinality
+      deviation = (2 * abs(observed_fraction - approx_pdf[i]) /
+                   (observed_fraction + approx_pdf[i]))
+
+      self.assertLess(deviation, 0.2)
+
+  def test_end_to_end_materialized(self):
+    self._test_end_to_end("materialized")
+
+  def test_end_to_end_bisection(self):
+    self._test_end_to_end("bisection")
+
+  def test_fresh_randomness_materialized(self):
+    self._test_fresh_randomness("materialized")
+
+  def test_fresh_randomness_bisection(self):
+    self._test_fresh_randomness("bisection")


 if __name__ == "__main__":

--- a/official/recommendation/model_runner.py
+++ b/official/recommendation/model_runner.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Contains NcfModelRunner, which can train and evaluate an NCF model."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from collections import namedtuple
-import os
-import time
-
-import tensorflow as tf
-
-from tensorflow.contrib.compiler import xla
-from official.recommendation import constants as rconst
-from official.recommendation import data_preprocessing
-from official.recommendation import neumf_model
-
-
-class NcfModelRunner(object):
-  """Creates a graph to train/evaluate an NCF model, and runs it.
-
-  This class builds both a training model and evaluation model in the graph.
-  The two models share variables, so that during evaluation, the trained
-  variables are used.
-  """
-
-  # _TrainModelProperties and _EvalModelProperties store useful properties of
-  # the training and evaluation models, respectively.
-  # _SHARED_MODEL_PROPERTY_FIELDS is their shared fields.
-  _SHARED_MODEL_PROPERTY_FIELDS = (
-      # A scalar tf.string placeholder tensor, that will be fed the path to the
-      # directory storing the TFRecord files for the input data.
-      "record_files_placeholder",
-      # The tf.data.Iterator to iterate over the input data.
-      "iterator",
-      # A scalar float tensor representing the model loss.
-      "loss",
-      # The batch size, as a Python int.
-      "batch_size",
-      # The op to run the model. For the training model, this trains the model
-      # for one step. For the evaluation model, this computes the metrics and
-      # updates the metric variables.
-      "run_model_op")
-  _TrainModelProperties = namedtuple("_TrainModelProperties",  # pylint: disable=invalid-name
-                                     _SHARED_MODEL_PROPERTY_FIELDS)
-  _EvalModelProperties = namedtuple(  # pylint: disable=invalid-name
-      "_EvalModelProperties", _SHARED_MODEL_PROPERTY_FIELDS + (
-          # A dict from metric name to metric tensor.
-          "metrics",
-          # Initializes the metric variables.
-          "metric_initializer",))
-
-  def __init__(self, ncf_dataset, params, num_train_steps, num_eval_steps,
-               use_while_loop):
-    self._num_train_steps = num_train_steps
-    self._num_eval_steps = num_eval_steps
-    self._use_while_loop = use_while_loop
-    with tf.Graph().as_default() as self._graph:
-      if params["use_xla_for_gpu"]:
-        # The XLA functions we use require resource variables.
-        tf.enable_resource_variables()
-      self._ncf_dataset = ncf_dataset
-      self._global_step = tf.train.create_global_step()
-      self._train_model_properties = self._build_model(params, num_train_steps,
-                                                       is_training=True)
-      self._eval_model_properties = self._build_model(params, num_eval_steps,
-                                                      is_training=False)
-
-      initializer = tf.global_variables_initializer()
-    self._graph.finalize()
-    self._session = tf.Session(graph=self._graph)
-    self._session.run(initializer)
-
-  def _compute_metric_mean(self, metric_name):
-    """Computes the mean from a call tf tf.metrics.mean().
-
-    tf.metrics.mean() already returns the mean, so normally this call is
-    unnecessary. But, if tf.metrics.mean() is called inside a tf.while_loop, the
-    mean cannot be accessed outside the while loop. Calling this function
-    recomputes the mean from the variables created by tf.metrics.mean(),
-    allowing the mean to be accessed outside the while loop.
-
-    Args:
-      metric_name: The string passed to the 'name' argument of tf.metrics.mean()
-
-    Returns:
-      The mean of the metric.
-    """
-    metric_vars = tf.get_collection(tf.GraphKeys.METRIC_VARIABLES)
-    total_suffix = metric_name + "/total:0"
-    total_vars = [v for v in metric_vars if v.name.endswith(total_suffix)]
-    assert len(total_vars) == 1., (
-        "Found {} metric variables ending with '{}' but expected to find "
-        "exactly 1. All metric variables: {}".format(
-            len(total_vars), total_suffix, metric_vars))
-    total_var = total_vars[0]
-
-    count_suffix = metric_name + "/count:0"
-    count_vars = [v for v in metric_vars if v.name.endswith(count_suffix)]
-    assert len(count_vars) == 1., (
-        "Found {} metric variables ending with '{}' but expected to find "
-        "exactly 1. All metric variables: {}".format(
-            len(count_vars), count_suffix, metric_vars))
-    count_var = count_vars[0]
-    return total_var / count_var
-
-
-  def _build_model(self, params, num_steps, is_training):
-    """Builds the NCF model.
-
-    Args:
-      params: A dict of hyperparameters.
-      is_training: If True, build the training model. If False, build the
-        evaluation model.
-    Returns:
-      A _TrainModelProperties if is_training is True, or an _EvalModelProperties
-      otherwise.
-    """
-    record_files_placeholder = tf.placeholder(tf.string, ())
-    input_fn, _, _ = \
-      data_preprocessing.make_input_fn(
-          ncf_dataset=self._ncf_dataset, is_training=is_training,
-          record_files=record_files_placeholder)
-    dataset = input_fn(params)
-    iterator = dataset.make_initializable_iterator()
-
-    model_fn = neumf_model.neumf_model_fn
-    if params["use_xla_for_gpu"]:
-      model_fn = xla.estimator_model_fn(model_fn)
-
-    if is_training:
-      return self._build_train_specific_graph(
-          iterator, model_fn, params, record_files_placeholder, num_steps)
-    else:
-      return self._build_eval_specific_graph(
-          iterator, model_fn, params, record_files_placeholder, num_steps)
-
-  def _build_train_specific_graph(self, iterator, model_fn, params,
-                                  record_files_placeholder, num_train_steps):
-    """Builds the part of the model that is specific to training."""
-
-    def build():
-      features, labels = iterator.get_next()
-      estimator_spec = model_fn(
-          features, labels, tf.estimator.ModeKeys.TRAIN, params)
-      with tf.control_dependencies([estimator_spec.train_op]):
-        run_model_op = self._global_step.assign_add(1)
-      return run_model_op, estimator_spec.loss
-
-    if self._use_while_loop:
-      def body(i):
-        run_model_op_single_step, _ = build()
-        with tf.control_dependencies([run_model_op_single_step]):
-          return i + 1
-
-      run_model_op = tf.while_loop(lambda i: i < num_train_steps, body, [0],
-                                   parallel_iterations=1)
-      loss = None
-    else:
-      run_model_op, loss = build()
-
-    return self._TrainModelProperties(
-        record_files_placeholder, iterator, loss, params["batch_size"],
-        run_model_op)
-
-  def _build_eval_specific_graph(self, iterator, model_fn, params,
-                                 record_files_placeholder, num_eval_steps):
-    """Builds the part of the model that is specific to evaluation."""
-
-    def build():
-      features = iterator.get_next()
-      estimator_spec = model_fn(
-          features, None, tf.estimator.ModeKeys.EVAL, params)
-      run_model_op = tf.group(*(update_op for _, update_op in
-                                estimator_spec.eval_metric_ops.values()))
-      eval_metric_tensors = {k: tensor for (k, (tensor, _))
-                             in estimator_spec.eval_metric_ops.items()}
-      return run_model_op, estimator_spec.loss, eval_metric_tensors
-
-    if self._use_while_loop:
-      def body(i):
-        run_model_op_single_step, _, _ = build()
-        with tf.control_dependencies([run_model_op_single_step]):
-          return i + 1
-
-      run_model_op = tf.while_loop(lambda i: i < num_eval_steps, body, [0],
-                                   parallel_iterations=1)
-      loss = None
-      eval_metric_tensors = {
-          "HR": self._compute_metric_mean(rconst.HR_METRIC_NAME),
-          "NDCG": self._compute_metric_mean(rconst.NDCG_METRIC_NAME),
-      }
-    else:
-      run_model_op, loss, eval_metric_tensors = build()
-
-    metric_initializer = tf.variables_initializer(
-        tf.get_collection(tf.GraphKeys.METRIC_VARIABLES))
-    return self._EvalModelProperties(
-        record_files_placeholder, iterator, loss, params["eval_batch_size"],
-        run_model_op, eval_metric_tensors, metric_initializer)
-
-  def _train_or_eval(self, model_properties, num_steps, is_training):
-    """Either trains or evaluates, depending on whether `is_training` is True.
-
-    Args:
-      model_properties: _TrainModelProperties or an _EvalModelProperties
-        containing the properties of the training or evaluation graph.
-      num_steps: The number of steps to train or evaluate for.
-      is_training: If True, run the training model. If False, run the evaluation
-        model.
-
-    Returns:
-      record_dir: The directory of TFRecords where the training/evaluation input
-      data was read from.
-    """
-    if self._ncf_dataset is not None:
-      epoch_metadata, record_dir, template = data_preprocessing.get_epoch_info(
-          is_training=is_training, ncf_dataset=self._ncf_dataset)
-      batch_count = epoch_metadata["batch_count"]
-      if batch_count != num_steps:
-        raise ValueError(
-            "Step counts do not match. ({} vs. {}) The async process is "
-            "producing incorrect shards.".format(batch_count, num_steps))
-      record_files = os.path.join(record_dir, template.format("*"))
-      initializer_feed_dict = {
-          model_properties.record_files_placeholder: record_files}
-      del batch_count
-    else:
-      initializer_feed_dict = None
-      record_dir = None
-
-    self._session.run(model_properties.iterator.initializer,
-                      initializer_feed_dict)
-    fetches = (model_properties.run_model_op,)
-    if model_properties.loss is not None:
-      fetches += (model_properties.loss,)
-    mode = "Train" if is_training else "Eval"
-    start = None
-    times_to_run = 1 if self._use_while_loop else num_steps
-    for i in range(times_to_run):
-      fetches_ = self._session.run(fetches)
-      if i % 100 == 0:
-        if start is None:
-          # Only start the timer after 100 steps so there is a warmup.
-          start = time.time()
-          start_step = i
-        if model_properties.loss is not None:
-          _, loss = fetches_
-          tf.logging.info("{} Loss = {}".format(mode, loss))
-    end = time.time()
-    if start is not None:
-      print("{} peformance: {} examples/sec".format(
-          mode, (i - start_step) * model_properties.batch_size / (end - start)))
-    return record_dir
-
-
-  def train(self):
-    """Trains the graph for a single cycle."""
-    record_dir = self._train_or_eval(self._train_model_properties,
-                                     self._num_train_steps, is_training=True)
-    if record_dir:
-      # We delete the record_dir because each cycle, new TFRecords is generated
-      # by the async process.
-      tf.gfile.DeleteRecursively(record_dir)
-
-  def eval(self):
-    """Evaluates the graph on the eval data.
-
-    Returns:
-      A dict of evaluation results.
-    """
-    self._session.run(self._eval_model_properties.metric_initializer)
-    self._train_or_eval(self._eval_model_properties, self._num_eval_steps,
-                        is_training=False)
-    eval_results = {
-        'global_step': self._session.run(self._global_step)}
-    for key, val in self._eval_model_properties.metrics.items():
-      val_ = self._session.run(val)
-      tf.logging.info("{} = {}".format(key, self._session.run(val)))
-      eval_results[key] = val_
-    return eval_results
--- a/official/recommendation/ncf_main.py
+++ b/official/recommendation/ncf_main.py
@@ -24,6 +24,8 @@ from __future__ import print_function

 import contextlib
 import heapq
+import json
+import logging
 import math
 import multiprocessing
 import os
@@ -40,8 +42,8 @@ import tensorflow as tf
 from tensorflow.contrib.compiler import xla
 from official.datasets import movielens
 from official.recommendation import constants as rconst
+from official.recommendation import data_pipeline
 from official.recommendation import data_preprocessing
-from official.recommendation import model_runner
 from official.recommendation import neumf_model
 from official.utils.flags import core as flags_core
 from official.utils.logs import hooks_helper
@@ -54,74 +56,125 @@ from official.utils.misc import model_helpers
 FLAGS = flags.FLAGS


-def construct_estimator(num_gpus, model_dir, iterations, params, batch_size,
-                        eval_batch_size):
+def construct_estimator(model_dir, params):
  """Construct either an Estimator or TPUEstimator for NCF.

  Args:
-    num_gpus: The number of gpus (Used to select distribution strategy)
    model_dir: The model directory for the estimator
-    iterations:  Estimator iterations
    params: The params dict for the estimator
-    batch_size: The mini-batch size for training.
-    eval_batch_size: The batch size used during evaluation.

  Returns:
    An Estimator or TPUEstimator.
  """

  if params["use_tpu"]:
+    # Some of the networking libraries are quite chatty.
+    for name in ["googleapiclient.discovery", "googleapiclient.discovery_cache",
+                 "oauth2client.transport"]:
+      logging.getLogger(name).setLevel(logging.ERROR)
+
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
        tpu=params["tpu"],
        zone=params["tpu_zone"],
        project=params["tpu_gcp_project"],
+        coordinator_name="coordinator"
    )
+
    tf.logging.info("Issuing reset command to TPU to ensure a clean state.")
    tf.Session.reset(tpu_cluster_resolver.get_master())

-    tpu_config = tf.contrib.tpu.TPUConfig(
-        iterations_per_loop=iterations,
-        num_shards=8)
-
-    run_config = tf.contrib.tpu.RunConfig(
-        cluster=tpu_cluster_resolver,
-        model_dir=model_dir,
-        save_checkpoints_secs=600,
-        session_config=tf.ConfigProto(
-            allow_soft_placement=True, log_device_placement=False),
-        tpu_config=tpu_config)
-
-    tpu_params = {k: v for k, v in params.items() if k != "batch_size"}
-
-    train_estimator = tf.contrib.tpu.TPUEstimator(
-        model_fn=neumf_model.neumf_model_fn,
-        use_tpu=True,
-        train_batch_size=batch_size,
-        eval_batch_size=eval_batch_size,
-        params=tpu_params,
-        config=run_config)
-
-    eval_estimator = tf.contrib.tpu.TPUEstimator(
-        model_fn=neumf_model.neumf_model_fn,
-        use_tpu=True,
-        train_batch_size=1,
-        eval_batch_size=eval_batch_size,
-        params=tpu_params,
-        config=run_config)
-
-    return train_estimator, eval_estimator
-
-  distribution = distribution_utils.get_distribution_strategy(num_gpus=num_gpus)
+    # Estimator looks at the master it connects to for MonitoredTrainingSession
+    # by reading the `TF_CONFIG` environment variable, and the coordinator
+    # is used by StreamingFilesDataset.
+    tf_config_env = {
+        "session_master": tpu_cluster_resolver.get_master(),
+        "eval_session_master": tpu_cluster_resolver.get_master(),
+        "coordinator": tpu_cluster_resolver.cluster_spec()
+                       .as_dict()["coordinator"]
+    }
+    os.environ['TF_CONFIG'] = json.dumps(tf_config_env)
+
+    distribution = tf.contrib.distribute.TPUStrategy(
+        tpu_cluster_resolver, steps_per_run=100)
+
+  else:
+    distribution = distribution_utils.get_distribution_strategy(
+        num_gpus=params["num_gpus"])
+
  run_config = tf.estimator.RunConfig(train_distribute=distribution,
                                      eval_distribute=distribution)
-  params["eval_batch_size"] = eval_batch_size
+
  model_fn = neumf_model.neumf_model_fn
  if params["use_xla_for_gpu"]:
    tf.logging.info("Using XLA for GPU for training and evaluation.")
    model_fn = xla.estimator_model_fn(model_fn)
  estimator = tf.estimator.Estimator(model_fn=model_fn, model_dir=model_dir,
                                     config=run_config, params=params)
-  return estimator, estimator
+  return estimator
+
+
+def log_and_get_hooks(eval_batch_size):
+  """Convenience function for hook and logger creation."""
+  # Create hooks that log information about the training and metric values
+  train_hooks = hooks_helper.get_train_hooks(
+      FLAGS.hooks,
+      model_dir=FLAGS.model_dir,
+      batch_size=FLAGS.batch_size,  # for ExamplesPerSecondHook
+      tensors_to_log={"cross_entropy": "cross_entropy"}
+  )
+  run_params = {
+      "batch_size": FLAGS.batch_size,
+      "eval_batch_size": eval_batch_size,
+      "number_factors": FLAGS.num_factors,
+      "hr_threshold": FLAGS.hr_threshold,
+      "train_epochs": FLAGS.train_epochs,
+  }
+  benchmark_logger = logger.get_benchmark_logger()
+  benchmark_logger.log_run_info(
+      model_name="recommendation",
+      dataset_name=FLAGS.dataset,
+      run_params=run_params,
+      test_id=FLAGS.benchmark_test_id)
+
+  return benchmark_logger, train_hooks
+
+
+def parse_flags(flags_obj):
+  """Convenience function to turn flags into params."""
+  num_gpus = flags_core.get_num_gpus(flags_obj)
+  num_devices = FLAGS.num_tpu_shards if FLAGS.tpu else num_gpus or 1
+
+  batch_size = (flags_obj.batch_size + num_devices - 1) // num_devices
+
+  eval_divisor = (rconst.NUM_EVAL_NEGATIVES + 1) * num_devices
+  eval_batch_size = flags_obj.eval_batch_size or flags_obj.batch_size
+  eval_batch_size = ((eval_batch_size + eval_divisor - 1) //
+                     eval_divisor * eval_divisor // num_devices)
+
+  return {
+      "train_epochs": flags_obj.train_epochs,
+      "batches_per_step": num_devices,
+      "use_seed": flags_obj.seed is not None,
+      "batch_size": batch_size,
+      "eval_batch_size": eval_batch_size,
+      "learning_rate": flags_obj.learning_rate,
+      "mf_dim": flags_obj.num_factors,
+      "model_layers": [int(layer) for layer in flags_obj.layers],
+      "mf_regularization": flags_obj.mf_regularization,
+      "mlp_reg_layers": [float(reg) for reg in flags_obj.mlp_regularization],
+      "num_neg": flags_obj.num_neg,
+      "num_gpus": num_gpus,
+      "use_tpu": flags_obj.tpu is not None,
+      "tpu": flags_obj.tpu,
+      "tpu_zone": flags_obj.tpu_zone,
+      "tpu_gcp_project": flags_obj.tpu_gcp_project,
+      "beta1": flags_obj.beta1,
+      "beta2": flags_obj.beta2,
+      "epsilon": flags_obj.epsilon,
+      "match_mlperf": flags_obj.ml_perf,
+      "use_xla_for_gpu": flags_obj.use_xla_for_gpu,
+      "epochs_between_evals": FLAGS.epochs_between_evals,
+  }


 def main(_):
@@ -129,7 +182,6 @@ def main(_):
       mlperf_helper.LOGGER(FLAGS.output_ml_perf_compliance_logging):
    mlperf_helper.set_ncf_root(os.path.split(os.path.abspath(__file__))[0])
    run_ncf(FLAGS)
-    mlperf_helper.stitch_ncf()


 def run_ncf(_):
@@ -140,105 +192,36 @@ def run_ncf(_):
  if FLAGS.seed is not None:
    np.random.seed(FLAGS.seed)

-  num_gpus = flags_core.get_num_gpus(FLAGS)
-  batch_size = distribution_utils.per_device_batch_size(
-      int(FLAGS.batch_size), num_gpus)
+  params = parse_flags(FLAGS)
  total_training_cycle = FLAGS.train_epochs // FLAGS.epochs_between_evals

-  eval_per_user = rconst.NUM_EVAL_NEGATIVES + 1
-  eval_batch_size = int(FLAGS.eval_batch_size or
-                        max([FLAGS.batch_size, eval_per_user]))
-  if eval_batch_size % eval_per_user:
-    eval_batch_size = eval_batch_size // eval_per_user * eval_per_user
-    tf.logging.warning(
-        "eval examples per user does not evenly divide eval_batch_size. "
-        "Overriding to {}".format(eval_batch_size))
-
  if FLAGS.use_synthetic_data:
-    ncf_dataset = None
-    cleanup_fn = lambda: None
+    producer = data_pipeline.DummyConstructor()
    num_users, num_items = data_preprocessing.DATASET_TO_NUM_USERS_AND_ITEMS[
        FLAGS.dataset]
-    num_train_steps = data_preprocessing.SYNTHETIC_BATCHES_PER_EPOCH
-    num_eval_steps = data_preprocessing.SYNTHETIC_BATCHES_PER_EPOCH
+    num_train_steps = rconst.SYNTHETIC_BATCHES_PER_EPOCH
+    num_eval_steps = rconst.SYNTHETIC_BATCHES_PER_EPOCH
  else:
-    ncf_dataset, cleanup_fn = data_preprocessing.instantiate_pipeline(
-        dataset=FLAGS.dataset, data_dir=FLAGS.data_dir,
-        batch_size=batch_size,
-        eval_batch_size=eval_batch_size,
-        num_neg=FLAGS.num_neg,
-        epochs_per_cycle=FLAGS.epochs_between_evals,
-        num_cycles=total_training_cycle,
-        match_mlperf=FLAGS.ml_perf,
-        deterministic=FLAGS.seed is not None,
-        use_subprocess=FLAGS.use_subprocess,
-        cache_id=FLAGS.cache_id)
-    num_users = ncf_dataset.num_users
-    num_items = ncf_dataset.num_items
-    num_train_steps = int(np.ceil(
-        FLAGS.epochs_between_evals * ncf_dataset.num_train_positives *
-        (1 + FLAGS.num_neg) / FLAGS.batch_size))
-    num_eval_steps = int(np.ceil((1 + rconst.NUM_EVAL_NEGATIVES) *
-                                 ncf_dataset.num_users / eval_batch_size))
-
+    num_users, num_items, producer = data_preprocessing.instantiate_pipeline(
+        dataset=FLAGS.dataset, data_dir=FLAGS.data_dir, params=params,
+        constructor_type=FLAGS.constructor_type,
+        deterministic=FLAGS.seed is not None)
+
+    num_train_steps = (producer.train_batches_per_epoch //
+                       params["batches_per_step"])
+    num_eval_steps = (producer.eval_batches_per_epoch //
+                      params["batches_per_step"])
+    assert not producer.train_batches_per_epoch % params["batches_per_step"]
+    assert not producer.eval_batches_per_epoch % params["batches_per_step"]
+  producer.start()
+
+  params["num_users"], params["num_items"] = num_users, num_items
  model_helpers.apply_clean(flags.FLAGS)

-  params = {
-      "use_seed": FLAGS.seed is not None,
-      "hash_pipeline": FLAGS.hash_pipeline,
-      "batch_size": batch_size,
-      "eval_batch_size": eval_batch_size,
-      "learning_rate": FLAGS.learning_rate,
-      "num_users": num_users,
-      "num_items": num_items,
-      "mf_dim": FLAGS.num_factors,
-      "model_layers": [int(layer) for layer in FLAGS.layers],
-      "mf_regularization": FLAGS.mf_regularization,
-      "mlp_reg_layers": [float(reg) for reg in FLAGS.mlp_regularization],
-      "num_neg": FLAGS.num_neg,
-      "use_tpu": FLAGS.tpu is not None,
-      "tpu": FLAGS.tpu,
-      "tpu_zone": FLAGS.tpu_zone,
-      "tpu_gcp_project": FLAGS.tpu_gcp_project,
-      "beta1": FLAGS.beta1,
-      "beta2": FLAGS.beta2,
-      "epsilon": FLAGS.epsilon,
-      "match_mlperf": FLAGS.ml_perf,
-      "use_xla_for_gpu": FLAGS.use_xla_for_gpu,
-      "use_estimator": FLAGS.use_estimator,
-  }
-  if FLAGS.use_estimator:
-    train_estimator, eval_estimator = construct_estimator(
-        num_gpus=num_gpus, model_dir=FLAGS.model_dir,
-        iterations=num_train_steps, params=params,
-        batch_size=flags.FLAGS.batch_size, eval_batch_size=eval_batch_size)
-  else:
-    runner = model_runner.NcfModelRunner(ncf_dataset, params, num_train_steps,
-                                         num_eval_steps, FLAGS.use_while_loop)
-
-  # Create hooks that log information about the training and metric values
-  train_hooks = hooks_helper.get_train_hooks(
-      FLAGS.hooks,
-      model_dir=FLAGS.model_dir,
-      batch_size=FLAGS.batch_size,  # for ExamplesPerSecondHook
-      tensors_to_log={"cross_entropy": "cross_entropy"}
-  )
-  run_params = {
-      "batch_size": FLAGS.batch_size,
-      "eval_batch_size": eval_batch_size,
-      "number_factors": FLAGS.num_factors,
-      "hr_threshold": FLAGS.hr_threshold,
-      "train_epochs": FLAGS.train_epochs,
-  }
-  benchmark_logger = logger.get_benchmark_logger()
-  benchmark_logger.log_run_info(
-      model_name="recommendation",
-      dataset_name=FLAGS.dataset,
-      run_params=run_params,
-      test_id=FLAGS.benchmark_test_id)
+  estimator = construct_estimator(model_dir=FLAGS.model_dir, params=params)

+  benchmark_logger, train_hooks = log_and_get_hooks(params["eval_batch_size"])

-  eval_input_fn = None
  target_reached = False
  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.TRAIN_LOOP)
  for cycle_index in range(total_training_cycle):
@@ -249,47 +232,21 @@ def run_ncf(_):
    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.TRAIN_EPOCH,
                            value=cycle_index)

-    # Train the model
-    if FLAGS.use_estimator:
-      train_input_fn, train_record_dir, batch_count = \
-        data_preprocessing.make_input_fn(
-            ncf_dataset=ncf_dataset, is_training=True)
-
-      if batch_count != num_train_steps:
-        raise ValueError(
-            "Step counts do not match. ({} vs. {}) The async process is "
-            "producing incorrect shards.".format(batch_count, num_train_steps))
-
-      train_estimator.train(input_fn=train_input_fn, hooks=train_hooks,
-                            steps=num_train_steps)
-      if train_record_dir:
-        tf.gfile.DeleteRecursively(train_record_dir)
-
-      tf.logging.info("Beginning evaluation.")
-      if eval_input_fn is None:
-        eval_input_fn, _, eval_batch_count = data_preprocessing.make_input_fn(
-            ncf_dataset=ncf_dataset, is_training=False)
-
-        if eval_batch_count != num_eval_steps:
-          raise ValueError(
-              "Step counts do not match. ({} vs. {}) The async process is "
-              "producing incorrect shards.".format(
-                  eval_batch_count, num_eval_steps))
-
-      mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_START,
-                              value=cycle_index)
-      eval_results = eval_estimator.evaluate(eval_input_fn,
-                                             steps=num_eval_steps)
-      tf.logging.info("Evaluation complete.")
-    else:
-      runner.train()
-      tf.logging.info("Beginning evaluation.")
-      mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_START,
-                              value=cycle_index)
-      eval_results = runner.eval()
-      tf.logging.info("Evaluation complete.")
+    train_input_fn = producer.make_input_fn(is_training=True)
+    estimator.train(input_fn=train_input_fn, hooks=train_hooks,
+                    steps=num_train_steps)
+
+    tf.logging.info("Beginning evaluation.")
+    eval_input_fn = producer.make_input_fn(is_training=False)
+
+    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_START,
+                            value=cycle_index)
+    eval_results = estimator.evaluate(eval_input_fn, steps=num_eval_steps)
+    tf.logging.info("Evaluation complete.")
+
    hr = float(eval_results[rconst.HR_KEY])
    ndcg = float(eval_results[rconst.NDCG_KEY])
+    loss = float(eval_results["loss"])

    mlperf_helper.ncf_print(
        key=mlperf_helper.TAGS.EVAL_TARGET,
@@ -300,18 +257,14 @@ def run_ncf(_):
        key=mlperf_helper.TAGS.EVAL_HP_NUM_NEG,
        value={"epoch": cycle_index, "value": rconst.NUM_EVAL_NEGATIVES})

-    # Logged by the async process during record creation.
-    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_HP_NUM_USERS,
-                            deferred=True)
-
    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.EVAL_STOP, value=cycle_index)

    # Benchmark the evaluation results
    benchmark_logger.log_evaluation_result(eval_results)
    # Log the HR and NDCG results.
    tf.logging.info(
-        "Iteration {}: HR = {:.4f}, NDCG = {:.4f}".format(
-            cycle_index + 1, hr, ndcg))
+        "Iteration {}: HR = {:.4f}, NDCG = {:.4f}, Loss = {:.4f}".format(
+            cycle_index + 1, hr, ndcg, loss))

    # If some evaluation threshold is met
    if model_helpers.past_stop_threshold(FLAGS.hr_threshold, hr):
@@ -320,7 +273,8 @@ def run_ncf(_):

  mlperf_helper.ncf_print(key=mlperf_helper.TAGS.RUN_STOP,
                          value={"success": target_reached})
-  cleanup_fn()  # Cleanup data construction artifacts and subprocess.
+  producer.stop_loop()
+  producer.join()

  # Clear the session explicitly to avoid session delete error
  tf.keras.backend.clear_session()
@@ -366,7 +320,7 @@ def define_ncf_flags():
      name="download_if_missing", default=True, help=flags_core.help_wrap(
          "Download data to data_dir if it is not already present."))

-  flags.DEFINE_string(
+  flags.DEFINE_integer(
      name="eval_batch_size", default=None, help=flags_core.help_wrap(
          "The batch size used for evaluation. This should generally be larger"
          "than the training batch size as the lack of back propagation during"
@@ -428,6 +382,14 @@ def define_ncf_flags():
          "For dataset ml-20m, the threshold can be set as 0.95 which is "
          "achieved by MLPerf implementation."))

+  flags.DEFINE_enum(
+      name="constructor_type", default="bisection",
+      enum_values=["bisection", "materialized"], case_sensitive=False,
+      help=flags_core.help_wrap(
+          "Strategy to use for generating false negatives. materialized has a"
+          "precompute that scales badly, but a faster per-epoch construction"
+          "time and can be faster on very large systems."))
+
  flags.DEFINE_bool(
      name="ml_perf", default=False,
      help=flags_core.help_wrap(
@@ -459,31 +421,12 @@ def define_ncf_flags():
      name="seed", default=None, help=flags_core.help_wrap(
          "This value will be used to seed both NumPy and TensorFlow."))

-  flags.DEFINE_bool(
-      name="hash_pipeline", default=False, help=flags_core.help_wrap(
-          "This flag will perform a separate run of the pipeline and hash "
-          "batches as they are produced. \nNOTE: this will significantly slow "
-          "training. However it is useful to confirm that a random seed is "
-          "does indeed make the data pipeline deterministic."))
-
  @flags.validator("eval_batch_size", "eval_batch_size must be at least {}"
                   .format(rconst.NUM_EVAL_NEGATIVES + 1))
  def eval_size_check(eval_batch_size):
    return (eval_batch_size is None or
            int(eval_batch_size) > rconst.NUM_EVAL_NEGATIVES)

-  flags.DEFINE_bool(
-      name="use_subprocess", default=True, help=flags_core.help_wrap(
-          "By default, ncf_main.py starts async data generation process as a "
-          "subprocess. If set to False, ncf_main.py will assume the async data "
-          "generation process has already been started by the user."))
-
-  flags.DEFINE_integer(name="cache_id", default=None, help=flags_core.help_wrap(
-      "Use a specified cache_id rather than using a timestamp. This is only "
-      "needed to synchronize across multiple workers. Generally this flag will "
-      "not need to be set."
-  ))
-
  flags.DEFINE_bool(
      name="use_xla_for_gpu", default=False, help=flags_core.help_wrap(
          "If True, use XLA for the model function. Only works when using a "
@@ -494,30 +437,6 @@ def define_ncf_flags():
  def xla_validator(flag_dict):
    return not flag_dict["use_xla_for_gpu"] or not flag_dict["tpu"]

-  flags.DEFINE_bool(
-      name="use_estimator", default=True, help=flags_core.help_wrap(
-          "If True, use Estimator to train. Setting to False is slightly "
-          "faster, but when False, the following are currently unsupported:\n"
-          "  * Using TPUs\n"
-          "  * Using more than 1 GPU\n"
-          "  * Reloading from checkpoints\n"
-          "  * Any hooks specified with --hooks\n"))
-
-  flags.DEFINE_bool(
-      name="use_while_loop", default=None, help=flags_core.help_wrap(
-          "If set, run an entire epoch in a session.run() call using a "
-          "TensorFlow while loop. This can improve performance, but will not "
-          "print out losses throughout the epoch. Requires "
-          "--use_estimator=false"
-      ))
-
-  xla_message = "--use_while_loop requires --use_estimator=false"
-  @flags.multi_flags_validator(["use_while_loop", "use_estimator"],
-                               message=xla_message)
-  def while_loop_validator(flag_dict):
-    return (not flag_dict["use_while_loop"] or
-            not flag_dict["use_estimator"])
-

 if __name__ == "__main__":
  tf.logging.set_verbosity(tf.logging.INFO)

--- a/official/recommendation/ncf_test.py
+++ b/official/recommendation/ncf_test.py
@@ -24,13 +24,11 @@ import mock
 import numpy as np
 import tensorflow as tf

-from absl import flags
 from absl.testing import flagsaver
 from official.recommendation import constants as rconst
-from official.recommendation import data_preprocessing
+from official.recommendation import data_pipeline
 from official.recommendation import neumf_model
 from official.recommendation import ncf_main
-from official.recommendation import stat_utils


 NUM_TRAIN_NEG = 4
@@ -56,6 +54,13 @@ class NcfTest(tf.test.TestCase):
                            top_k=rconst.TOP_K, match_mlperf=False):
    rconst.TOP_K = top_k
    rconst.NUM_EVAL_NEGATIVES = predicted_scores_by_user.shape[1] - 1
+    batch_size = items_by_user.shape[0]
+
+    users = np.repeat(np.arange(batch_size)[:, np.newaxis],
+                      rconst.NUM_EVAL_NEGATIVES + 1, axis=1)
+    users, items, duplicate_mask = \
+      data_pipeline.BaseDataConstructor._assemble_eval_batch(
+          users, items_by_user[:, -1:], items_by_user[:, :-1], batch_size)

    g = tf.Graph()
    with g.as_default():
@@ -63,8 +68,7 @@ class NcfTest(tf.test.TestCase):
          predicted_scores_by_user.reshape((-1, 1)), tf.float32)
      softmax_logits = tf.concat([tf.zeros(logits.shape, dtype=logits.dtype),
                                  logits], axis=1)
-      duplicate_mask = tf.convert_to_tensor(
-          stat_utils.mask_duplicates(items_by_user, axis=1), tf.float32)
+      duplicate_mask = tf.convert_to_tensor(duplicate_mask, tf.float32)

      metric_ops = neumf_model.compute_eval_loss_and_metrics(
          logits=logits, softmax_logits=softmax_logits,
@@ -81,21 +85,19 @@ class NcfTest(tf.test.TestCase):
      sess.run(init)
      return sess.run([hr[1], ndcg[1]])

-
-
  def test_hit_rate_and_ndcg(self):
    # Test with no duplicate items
    predictions = np.array([
-        [1., 2., 0.],  # In top 2
-        [2., 1., 0.],  # In top 1
-        [0., 2., 1.],  # In top 3
-        [2., 3., 4.]   # In top 3
+        [2., 0., 1.],  # In top 2
+        [1., 0., 2.],  # In top 1
+        [2., 1., 0.],  # In top 3
+        [3., 4., 2.]   # In top 3
    ])
    items = np.array([
-        [1, 2, 3],
        [2, 3, 1],
-        [3, 2, 1],
+        [3, 1, 2],
        [2, 1, 3],
+        [1, 3, 2],
    ])

    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 1)
@@ -130,16 +132,16 @@ class NcfTest(tf.test.TestCase):
    # Test with duplicate items. In the MLPerf case, we treat the duplicates as
    # a single item. Otherwise, we treat the duplicates as separate items.
    predictions = np.array([
-        [1., 2., 2., 3.],  # In top 4. MLPerf: In top 3
-        [3., 1., 0., 2.],  # In top 1. MLPerf: In top 1
-        [0., 2., 3., 2.],  # In top 4. MLPerf: In top 3
-        [3., 2., 4., 2.]   # In top 2. MLPerf: In top 2
+        [2., 2., 3., 1.],  # In top 4. MLPerf: In top 3
+        [1., 0., 2., 3.],  # In top 1. MLPerf: In top 1
+        [2., 3., 2., 0.],  # In top 4. MLPerf: In top 3
+        [2., 4., 2., 3.]   # In top 2. MLPerf: In top 2
    ])
    items = np.array([
-        [1, 2, 2, 3],
-        [1, 2, 3, 4],
-        [1, 2, 3, 2],
-        [4, 3, 2, 1],
+        [2, 2, 3, 1],
+        [2, 3, 4, 1],
+        [2, 3, 2, 1],
+        [3, 2, 1, 4],
    ])
    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 1)
    self.assertAlmostEqual(hr, 1 / 4)
@@ -180,59 +182,6 @@ class NcfTest(tf.test.TestCase):
    self.assertAlmostEqual(ndcg, (1 + math.log(2) / math.log(3) +
                                  2 * math.log(2) / math.log(4)) / 4)

-    # Test with duplicate items, where the predictions for the same item can
-    # differ. In the MLPerf case, we should take the first prediction.
-    predictions = np.array([
-        [3., 2., 4., 4.],  # In top 3. MLPerf: In top 2
-        [3., 4., 2., 4.],  # In top 3. MLPerf: In top 3
-        [2., 3., 4., 1.],  # In top 3. MLPerf: In top 2
-        [4., 3., 5., 2.]   # In top 2. MLPerf: In top 1
-    ])
-    items = np.array([
-        [1, 2, 2, 3],
-        [4, 3, 3, 2],
-        [2, 1, 1, 1],
-        [4, 2, 2, 1],
-    ])
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 1)
-    self.assertAlmostEqual(hr, 0 / 4)
-    self.assertAlmostEqual(ndcg, 0 / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 2)
-    self.assertAlmostEqual(hr, 1 / 4)
-    self.assertAlmostEqual(ndcg, (math.log(2) / math.log(3)) / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 3)
-    self.assertAlmostEqual(hr, 4 / 4)
-    self.assertAlmostEqual(ndcg, (math.log(2) / math.log(3) +
-                                  3 * math.log(2) / math.log(4)) / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 4)
-    self.assertAlmostEqual(hr, 4 / 4)
-    self.assertAlmostEqual(ndcg, (math.log(2) / math.log(3) +
-                                  3 * math.log(2) / math.log(4)) / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 1,
-                                          match_mlperf=True)
-    self.assertAlmostEqual(hr, 1 / 4)
-    self.assertAlmostEqual(ndcg, 1 / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 2,
-                                          match_mlperf=True)
-    self.assertAlmostEqual(hr, 3 / 4)
-    self.assertAlmostEqual(ndcg, (1 + 2 * math.log(2) / math.log(3)) / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 3,
-                                          match_mlperf=True)
-    self.assertAlmostEqual(hr, 4 / 4)
-    self.assertAlmostEqual(ndcg, (1 + 2 * math.log(2) / math.log(3) +
-                                  math.log(2) / math.log(4)) / 4)
-
-    hr, ndcg = self.get_hit_rate_and_ndcg(predictions, items, 4,
-                                          match_mlperf=True)
-    self.assertAlmostEqual(hr, 4 / 4)
-    self.assertAlmostEqual(ndcg, (1 + 2 * math.log(2) / math.log(3) +
-                                  math.log(2) / math.log(4)) / 4)

  _BASE_END_TO_END_FLAGS = {
      "batch_size": 1024,
@@ -241,33 +190,15 @@ class NcfTest(tf.test.TestCase):
  }

  @flagsaver.flagsaver(**_BASE_END_TO_END_FLAGS)
-  @mock.patch.object(data_preprocessing, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
+  @mock.patch.object(rconst, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
  def test_end_to_end(self):
    ncf_main.main(None)

  @flagsaver.flagsaver(ml_perf=True, **_BASE_END_TO_END_FLAGS)
-  @mock.patch.object(data_preprocessing, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
+  @mock.patch.object(rconst, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
  def test_end_to_end_mlperf(self):
    ncf_main.main(None)

-  @flagsaver.flagsaver(use_estimator=False, **_BASE_END_TO_END_FLAGS)
-  @mock.patch.object(data_preprocessing, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
-  def test_end_to_end_no_estimator(self):
-    ncf_main.main(None)
-    flags.FLAGS.ml_perf = True
-    ncf_main.main(None)
-
-  @flagsaver.flagsaver(use_estimator=False, **_BASE_END_TO_END_FLAGS)
-  @mock.patch.object(data_preprocessing, "SYNTHETIC_BATCHES_PER_EPOCH", 100)
-  def test_end_to_end_while_loop(self):
-    # We cannot set use_while_loop = True in the flagsaver constructor, because
-    # if the flagsaver sets it to True before setting use_estimator to False,
-    # the flag validator will throw an error.
-    flags.FLAGS.use_while_loop = True
-    ncf_main.main(None)
-    flags.FLAGS.ml_perf = True
-    ncf_main.main(None)
-

 if __name__ == "__main__":
  tf.logging.set_verbosity(tf.logging.INFO)

--- a/official/recommendation/neumf_model.py
+++ b/official/recommendation/neumf_model.py
@@ -76,44 +76,24 @@ def neumf_model_fn(features, labels, mode, params):
    tf.set_random_seed(stat_utils.random_int32())

  users = features[movielens.USER_COLUMN]
-  items = tf.cast(features[movielens.ITEM_COLUMN], tf.int32)
+  items = features[movielens.ITEM_COLUMN]

-  keras_model = params.get("keras_model")
-  if keras_model:
-    logits = keras_model([users, items],
-                         training=mode == tf.estimator.ModeKeys.TRAIN)
-  else:
-    keras_model = construct_model(users=users, items=items, params=params)
-    logits = keras_model.output
-  if not params["use_estimator"] and "keras_model" not in params:
-    # When we are not using estimator, we need to reuse the Keras model when
-    # this model_fn is called again, so that the variables are shared between
-    # training and eval. So we mutate params to add the Keras model.
-    params["keras_model"] = keras_model
+  logits = construct_model(users, items, params).output

  # Softmax with the first column of zeros is equivalent to sigmoid.
  softmax_logits = tf.concat([tf.zeros(logits.shape, dtype=logits.dtype),
                              logits], axis=1)

-  if mode == tf.estimator.ModeKeys.PREDICT:
-    predictions = {
-        movielens.ITEM_COLUMN: items,
-        movielens.RATING_COLUMN: logits,
-    }
-
-    if params["use_tpu"]:
-      return tf.contrib.tpu.TPUEstimatorSpec(mode=mode, predictions=predictions)
-    return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
-
-  elif mode == tf.estimator.ModeKeys.EVAL:
+  if mode == tf.estimator.ModeKeys.EVAL:
    duplicate_mask = tf.cast(features[rconst.DUPLICATE_MASK], tf.float32)
    return compute_eval_loss_and_metrics(
        logits, softmax_logits, duplicate_mask, params["num_neg"],
        params["match_mlperf"],
-        use_tpu_spec=params["use_tpu"] or params["use_xla_for_gpu"])
+        use_tpu_spec=params["use_xla_for_gpu"])

  elif mode == tf.estimator.ModeKeys.TRAIN:
    labels = tf.cast(labels, tf.int32)
+    valid_pt_mask = features[rconst.VALID_POINT_MASK]

    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.OPT_NAME, value="adam")
    mlperf_helper.ncf_print(key=mlperf_helper.TAGS.OPT_LR,
@@ -135,7 +115,8 @@ def neumf_model_fn(features, labels, mode, params):
                            value=mlperf_helper.TAGS.BCE)
    loss = tf.losses.sparse_softmax_cross_entropy(
        labels=labels,
-        logits=softmax_logits
+        logits=softmax_logits,
+        weights=tf.cast(valid_pt_mask, tf.float32)
    )

    # This tensor is used by logging hooks.
@@ -151,9 +132,6 @@ def neumf_model_fn(features, labels, mode, params):
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    train_op = tf.group(minimize_op, update_ops)

-    if params["use_tpu"]:
-      return tf.contrib.tpu.TPUEstimatorSpec(
-          mode=mode, loss=loss, train_op=train_op)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

  else:
@@ -161,21 +139,18 @@ def neumf_model_fn(features, labels, mode, params):


 def construct_model(users, items, params):
-  # type: (tf.Tensor, tf.Tensor, dict) -> tf.Tensor
+  # type: (tf.Tensor, tf.Tensor, dict) -> tf.keras.Model
  """Initialize NeuMF model.

  Args:
    users: Tensor of user ids.
    items: Tensor of item ids.
    params: Dict of hyperparameters.
-
  Raises:
    ValueError: if the first model layer is not even.
-
  Returns:
-    logits:  network logits
+    model:  a keras Model for computing the logits
  """
-
  num_users = params["num_users"]
  num_items = params["num_items"]

@@ -194,82 +169,39 @@ def construct_model(users, items, params):
    raise ValueError("The first layer size should be multiple of 2!")

  # Input variables
-  user_input = tf.keras.layers.Input(tensor=users)
-  item_input = tf.keras.layers.Input(tensor=items)
-  batch_size = user_input.get_shape()[0]
-
-  if params["use_tpu"]:
-    with tf.variable_scope("embed_weights", reuse=tf.AUTO_REUSE):
-      cmb_embedding_user = tf.get_variable(
-          name="embeddings_mf_user",
-          shape=[num_users, mf_dim + model_layers[0] // 2],
-          initializer=tf.glorot_uniform_initializer())
-
-      cmb_embedding_item = tf.get_variable(
-          name="embeddings_mf_item",
-          shape=[num_items, mf_dim + model_layers[0] // 2],
-          initializer=tf.glorot_uniform_initializer())
-
-      cmb_user_latent = tf.keras.layers.Lambda(lambda ids: tf.gather(
-          cmb_embedding_user, ids))(user_input)
-
-      cmb_item_latent = tf.keras.layers.Lambda(lambda ids: tf.gather(
-          cmb_embedding_item, ids))(item_input)
-
-      mlp_user_latent = tf.keras.layers.Lambda(
-          lambda x: tf.slice(x, [0, 0], [batch_size, model_layers[0] // 2])
-      )(cmb_user_latent)
-
-      mlp_item_latent = tf.keras.layers.Lambda(
-          lambda x: tf.slice(x, [0, 0], [batch_size, model_layers[0] // 2])
-      )(cmb_item_latent)
-
-      mf_user_latent = tf.keras.layers.Lambda(
-          lambda x: tf.slice(x, [0, model_layers[0] // 2], [batch_size, mf_dim])
-      )(cmb_user_latent)
-
-      mf_item_latent = tf.keras.layers.Lambda(
-          lambda x: tf.slice(x, [0, model_layers[0] // 2], [batch_size, mf_dim])
-      )(cmb_item_latent)
-
-  else:
-    # Initializer for embedding layers
-    embedding_initializer = "glorot_uniform"
-
-    # Embedding layers of GMF and MLP
-    mf_embedding_user = tf.keras.layers.Embedding(
-        num_users,
-        mf_dim,
-        embeddings_initializer=embedding_initializer,
-        embeddings_regularizer=tf.keras.regularizers.l2(mf_regularization),
-        input_length=1)
-    mf_embedding_item = tf.keras.layers.Embedding(
-        num_items,
-        mf_dim,
-        embeddings_initializer=embedding_initializer,
-        embeddings_regularizer=tf.keras.regularizers.l2(mf_regularization),
-        input_length=1)
-
-    mlp_embedding_user = tf.keras.layers.Embedding(
-        num_users,
-        model_layers[0]//2,
-        embeddings_initializer=embedding_initializer,
-        embeddings_regularizer=tf.keras.regularizers.l2(mlp_reg_layers[0]),
-        input_length=1)
-    mlp_embedding_item = tf.keras.layers.Embedding(
-        num_items,
-        model_layers[0]//2,
-        embeddings_initializer=embedding_initializer,
-        embeddings_regularizer=tf.keras.regularizers.l2(mlp_reg_layers[0]),
-        input_length=1)
-
-    # GMF part
-    mf_user_latent = mf_embedding_user(user_input)
-    mf_item_latent = mf_embedding_item(item_input)
-
-    # MLP part
-    mlp_user_latent = mlp_embedding_user(user_input)
-    mlp_item_latent = mlp_embedding_item(item_input)
+  user_input = tf.keras.layers.Input(tensor=users, name="user_input")
+  item_input = tf.keras.layers.Input(tensor=items, name="item_input")
+
+  # Initializer for embedding layers
+  embedding_initializer = "glorot_uniform"
+
+  # It turns out to be significantly more effecient to store the MF and MLP
+  # embedding portions in the same table, and then slice as needed.
+  mf_slice_fn = lambda x: x[:, :mf_dim]
+  mlp_slice_fn = lambda x: x[:, mf_dim:]
+  embedding_user = tf.keras.layers.Embedding(
+      num_users, mf_dim + model_layers[0] // 2,
+      embeddings_initializer=embedding_initializer,
+      embeddings_regularizer=tf.keras.regularizers.l2(mf_regularization),
+      input_length=1, name="embedding_user")(user_input)
+
+  embedding_item = tf.keras.layers.Embedding(
+      num_items, mf_dim + model_layers[0] // 2,
+      embeddings_initializer=embedding_initializer,
+      embeddings_regularizer=tf.keras.regularizers.l2(mf_regularization),
+      input_length=1, name="embedding_item")(item_input)
+
+  # GMF part
+  mf_user_latent = tf.keras.layers.Lambda(
+      mf_slice_fn, name="embedding_user_mf")(embedding_user)
+  mf_item_latent = tf.keras.layers.Lambda(
+      mf_slice_fn, name="embedding_item_mf")(embedding_item)
+
+  # MLP part
+  mlp_user_latent = tf.keras.layers.Lambda(
+      mlp_slice_fn, name="embedding_user_mlp")(embedding_user)
+  mlp_item_latent = tf.keras.layers.Lambda(
+      mlp_slice_fn, name="embedding_item_mlp")(embedding_item)

  # Element-wise multiply
  mf_vector = tf.keras.layers.multiply([mf_user_latent, mf_item_latent])
@@ -352,7 +284,7 @@ def compute_eval_loss_and_metrics(logits,              # type: tf.Tensor
  Args:
    logits: A tensor containing the predicted logits for each user. The shape
      of logits is (num_users_per_batch * (1 + NUM_EVAL_NEGATIVES),) Logits
-      for a user are grouped, and the first element of the group is the true
+      for a user are grouped, and the last element of the group is the true
      element.

    softmax_logits: The same tensor, but with zeros left-appended.
@@ -377,9 +309,9 @@ def compute_eval_loss_and_metrics(logits,              # type: tf.Tensor

  # Examples are provided by the eval Dataset in a structured format, so eval
  # labels can be reconstructed on the fly.
-  eval_labels = tf.reshape(tf.one_hot(
-      tf.zeros(shape=(logits_by_user.shape[0],), dtype=tf.int32),
-      logits_by_user.shape[1], dtype=tf.int32), (-1,))
+  eval_labels = tf.reshape(shape=(-1,), tensor=tf.one_hot(
+      tf.zeros(shape=(logits_by_user.shape[0],), dtype=tf.int32) +
+      rconst.NUM_EVAL_NEGATIVES, logits_by_user.shape[1], dtype=tf.int32))

  eval_labels_float = tf.cast(eval_labels, tf.float32)

@@ -463,7 +395,8 @@ def compute_top_k_and_ndcg(logits,              # type: tf.Tensor
  # perform matrix multiplications very quickly. This is similar to np.argwhere.
  # However this is a special case because the target will only appear in
  # sort_indices once.
-  one_hot_position = tf.cast(tf.equal(sort_indices, 0), tf.int32)
+  one_hot_position = tf.cast(tf.equal(sort_indices, rconst.NUM_EVAL_NEGATIVES),
+                             tf.int32)
  sparse_positions = tf.multiply(
      one_hot_position, tf.range(logits_by_user.shape[1])[tf.newaxis, :])
  position_vector = tf.reduce_sum(sparse_positions, axis=1)

--- a/official/recommendation/popen_helper.py
+++ b/official/recommendation/popen_helper.py
@@ -16,21 +16,45 @@

 import contextlib
 import multiprocessing
-import os
-import sys
+import multiprocessing.pool


-_PYTHON = sys.executable
-if not _PYTHON:
-  raise RuntimeError("Could not find path to Python interpreter in order to "
-                     "spawn subprocesses.")
+def get_forkpool(num_workers, init_worker=None, closing=True):
+  pool = multiprocessing.Pool(processes=num_workers, initializer=init_worker)
+  return contextlib.closing(pool) if closing else pool

-_ASYNC_GEN_PATH = os.path.join(os.path.dirname(__file__),
-                               "data_async_generation.py")

-INVOCATION = [_PYTHON, _ASYNC_GEN_PATH]
+def get_threadpool(num_workers, init_worker=None, closing=True):
+  pool = multiprocessing.pool.ThreadPool(processes=num_workers,
+                                         initializer=init_worker)
+  return contextlib.closing(pool) if closing else pool


-def get_pool(num_workers, init_worker=None):
-  return contextlib.closing(multiprocessing.Pool(
-      processes=num_workers, initializer=init_worker))
+class FauxPool(object):
+  """Mimic a pool using for loops.
+
+  This class is used in place of proper pools when true determinism is desired
+  for testing or debugging.
+  """
+  def __init__(self, *args, **kwargs):
+    pass
+
+  def map(self, func, iterable, chunksize=None):
+    return [func(i) for i in iterable]
+
+  def imap(self, func, iterable, chunksize=1):
+    for i in iterable:
+      yield func(i)
+
+  def close(self):
+    pass
+
+  def terminate(self):
+    pass
+
+  def join(self):
+    pass
+
+def get_fauxpool(num_workers, init_worker=None, closing=True):
+  pool = FauxPool(processes=num_workers, initializer=init_worker)
+  return contextlib.closing(pool) if closing else pool
--- a/official/recommendation/run.sh
+++ b/official/recommendation/run.sh
@@ -27,7 +27,7 @@ mkdir -p ${LOCAL_TEST_DIR}

 TPU=${TPU:-""}
 if [[ -z ${TPU} ]]; then
-  DEVICE_FLAG="--num_gpus -1 --use_xla_for_gpu"
+  DEVICE_FLAG="--num_gpus -1" # --use_xla_for_gpu"
 else
  DEVICE_FLAG="--tpu ${TPU} --num_gpus 0"
 fi
@@ -54,25 +54,25 @@ do

  # To reduce variation set the seed flag:
  #   --seed ${i}
-  #
-  # And to confirm that the pipeline is deterministic pass the flag:
-  #   --hash_pipeline
-  #
-  # (`--hash_pipeline` will slow down training, though not as much as one might imagine.)
-  python ncf_main.py --model_dir ${MODEL_DIR} \
-                     --data_dir ${DATA_DIR} \
-                     --dataset ${DATASET} --hooks "" \
-                     ${DEVICE_FLAG} \
-                     --clean \
-                     --train_epochs 20 \
-                     --batch_size 2048 \
-                     --eval_batch_size 100000 \
-                     --learning_rate 0.0005 \
-                     --layers 256,256,128,64 --num_factors 64 \
-                     --hr_threshold 0.635 \
-                     --ml_perf \
+
+  python -u ncf_main.py \
+      --model_dir ${MODEL_DIR} \
+      --data_dir ${DATA_DIR} \
+      --dataset ${DATASET} --hooks "" \
+      ${DEVICE_FLAG} \
+      --clean \
+      --train_epochs 14 \
+      --batch_size 98304 \
+      --eval_batch_size 160000 \
+      --learning_rate 0.00382059 \
+      --beta1 0.783529 \
+      --beta2 0.909003 \
+      --epsilon 1.45439e-07 \
+      --layers 256,256,128,64 --num_factors 64 \
+      --hr_threshold 0.635 \
+      --ml_perf \
 |& tee ${RUN_LOG} \
- | grep --line-buffered  -E --regexp="(Iteration [0-9]+: HR = [0-9\.]+, NDCG = [0-9\.]+)|(pipeline_hash)|(MLPerf time:)"
+ | grep --line-buffered  -E --regexp="(Iteration [0-9]+: HR = [0-9\.]+, NDCG = [0-9\.]+, Loss = [0-9\.]+)|(pipeline_hash)|(MLPerf time:)"

  END_TIME=$(date +%s)
  echo "Run ${i} complete: $(( $END_TIME - $START_TIME )) seconds."

--- a/official/recommendation/stat_utils.py
+++ b/official/recommendation/stat_utils.py
@@ -18,71 +18,45 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+import os
+
 import numpy as np

+
 def random_int32():
  return np.random.randint(low=0, high=np.iinfo(np.int32).max, dtype=np.int32)

-def sample_with_exclusion(num_items, positive_set, n, replacement=True):
-  # type: (int, typing.Iterable, int, bool) -> list
-  """Vectorized negative sampling.

-  This function samples from the positive set's conjugate, both with and
-  without replacement.
+def permutation(args):
+  """Fork safe permutation function.

-  Performance:
-    This algorithm generates a vector of candidate values based on the expected
-    number needed such that at least k are not in the positive set, where k
-    is the number of false negatives still needed. An additional factor of
-    safety of 1.2 is used during the generation to minimize the chance of having
-    to perform another generation cycle.
-
-    While this approach generates more values than needed and then discards some
-    of them, vectorized generation is inexpensive and turns out to be much
-    faster than generating points one at a time. (And it defers quite a bit
-    of work to NumPy which has much better multi-core utilization than native
-    Python.)
+  This function can be called within a multiprocessing worker and give
+  appropriately random results.

  Args:
-    num_items: The cardinality of the entire set of items.
-    positive_set: The set of positive items which should not be included as
-      negatives.
-    n: The number of negatives to generate.
-    replacement: Whether to sample with (True) or without (False) replacement.
+    args: A size two tuple that will unpacked into the size of the permutation
+      and the random seed. This form is used because starmap is not universally
+      available.

-  Returns:
-    A list of generated negatives.
+  returns:
+    A NumPy array containing a random permutation.
  """
+  x, seed = args
+
+  # If seed is None NumPy will seed randomly.
+  state = np.random.RandomState(seed=seed)  # pylint: disable=no-member
+  output = np.arange(x, dtype=np.int32)
+  state.shuffle(output)
+  return output
+
+
+def very_slightly_biased_randint(max_val_vector):
+  sample_dtype = np.uint64
+  out_dtype = max_val_vector.dtype
+  samples = np.random.randint(low=0, high=np.iinfo(sample_dtype).max,
+                              size=max_val_vector.shape, dtype=sample_dtype)
+  return np.mod(samples, max_val_vector.astype(sample_dtype)).astype(out_dtype)

-  if not isinstance(positive_set, set):
-    positive_set = set(positive_set)
-
-  p = 1 - len(positive_set) /  num_items
-  n_attempt = int(n * (1 / p) * 1.2)  # factor of 1.2 for safety
-
-  # If sampling is performed with replacement, candidates are appended.
-  # Otherwise, they should be added with a set union to remove duplicates.
-  if replacement:
-    negatives = []
-  else:
-    negatives = set()
-
-  while len(negatives) < n:
-    negative_candidates = np.random.randint(
-        low=0, high=num_items, size=(n_attempt,))
-    if replacement:
-      negatives.extend(
-          [i for i in negative_candidates if i not in positive_set]
-      )
-    else:
-      negatives |= (set(negative_candidates) - positive_set)
-
-  if not replacement:
-    negatives = list(negatives)
-    np.random.shuffle(negatives)  # list(set(...)) is not order guaranteed, but
-    # in practice tends to be quite ordered.
-
-  return negatives[:n]

 def mask_duplicates(x, axis=1):  # type: (np.ndarray, int) -> np.ndarray
  """Identify duplicates from sampling with replacement.

--- a/official/requirements.txt
+++ b/official/requirements.txt
@@ -2,9 +2,10 @@ google-api-python-client>=1.6.7
 google-cloud-bigquery>=0.31.0
 kaggle>=1.3.9
 mlperf_compliance==0.0.10
-numpy
+numpy>=1.15.4
 oauth2client>=4.1.2
-pandas
+pandas>=0.22.0
 psutil>=5.4.3
 py-cpuinfo>=3.3.0
+scipy>=0.19.1
 typing
--- a/official/resnet/keras/keras_common.py
+++ b/official/resnet/keras/keras_common.py
@@ -228,5 +228,3 @@ class DummyContextManager(object):

  def __exit__(self, *args):
    pass
-
-
--- a/official/utils/logs/mlperf_helper.py
+++ b/official/utils/logs/mlperf_helper.py
@@ -34,7 +34,7 @@ import typing

 import tensorflow as tf

-_MIN_VERSION = (0, 0, 6)
+_MIN_VERSION = (0, 0, 10)
 _STACK_OFFSET = 2

 SUDO = "sudo" if os.geteuid() else ""
@@ -186,60 +186,6 @@ def clear_system_caches():
    raise ValueError("Failed to clear caches")


-def stitch_ncf():
-  """Format NCF logs for MLPerf compliance."""
-  if not LOGGER.enabled:
-    return
-
-  if LOGGER.log_file is None or not tf.gfile.Exists(LOGGER.log_file):
-    tf.logging.warning("Could not find log file to stitch.")
-    return
-
-  log_lines = []
-  num_eval_users = None
-  start_time = None
-  stop_time = None
-  with tf.gfile.Open(LOGGER.log_file, "r") as f:
-    for line in f:
-      parsed_line = parse_line(line)
-      if not parsed_line:
-        tf.logging.warning("Failed to parse line: {}".format(line))
-        continue
-      log_lines.append(parsed_line)
-
-      if parsed_line.tag == TAGS.RUN_START:
-        assert start_time is None
-        start_time = float(parsed_line.timestamp)
-
-      if parsed_line.tag == TAGS.RUN_STOP:
-        assert stop_time is None
-        stop_time = float(parsed_line.timestamp)
-
-      if (parsed_line.tag == TAGS.EVAL_HP_NUM_USERS and parsed_line.value
-          is not None and "DEFERRED" not in parsed_line.value):
-        assert num_eval_users is None or num_eval_users == parsed_line.value
-        num_eval_users = parsed_line.value
-        log_lines.pop()
-
-  for i, parsed_line in enumerate(log_lines):
-    if parsed_line.tag == TAGS.EVAL_HP_NUM_USERS:
-      log_lines[i] = ParsedLine(*parsed_line[:-1], value=num_eval_users)
-
-  log_lines = sorted([unparse_line(i) for i in log_lines])
-
-  output_path = os.getenv("STITCHED_COMPLIANCE_FILE", None)
-  if output_path:
-    with tf.gfile.Open(output_path, "w") as f:
-      for line in log_lines:
-        f.write(line + "\n")
-  else:
-    for line in log_lines:
-      print(line)
-    sys.stdout.flush()
-
-  if start_time is not None and stop_time is not None:
-    tf.logging.info("MLPerf time: {:.1f} sec.".format(stop_time - start_time))
-
 if __name__ == "__main__":
  tf.logging.set_verbosity(tf.logging.INFO)
  with LOGGER(True):

--- a/official/utils/testing/pylint.rcfile
+++ b/official/utils/testing/pylint.rcfile
@@ -146,6 +146,10 @@ no-space-check=
 # else.
 single-line-if-stmt=yes

+# Allow URLs and comment type annotations to exceed the max line length as neither can be easily
+# split across lines.
+ignore-long-lines=^\s*(?:(# )?<?https?://\S+>?$|# type:)
+

 [VARIABLES]