Add utilities and documentation for preparing training data. (#5399)

* add scripts to generate training data * add scripts to generate training data * add missing POS tag mapping * tweak flags to not rely on directory structure, fix issues with output. * add blurbs for new programs * oops, need to initialize the embeddings! * Misspelled 'info' * editing Co-authored-by: Chris Waterson <waterson@google.com>

Add utilities and documentation for preparing training data. (#5399)
* add scripts to generate training data * add scripts to generate training data * add missing POS tag mapping * tweak flags to not rely on directory structure, fix issues with output. * add blurbs for new programs * oops, need to initialize the embeddings! * Misspelled 'info' * editing Co-authored-by: Chris Waterson <waterson@google.com>
ab6f78d2 · Chris Waterson · GitHub · bb6f092a · ab6f78d2 · ab6f78d2
Unverified Commit ab6f78d2 authored Apr 23, 2020 by Chris Waterson Committed by GitHub Apr 23, 2020
7 changed files
--- a/research/lexnet_nc/README.md
+++ b/research/lexnet_nc/README.md
@@ -33,17 +33,16 @@ Training a model requires the following:
 1. A collection of noun compounds that have been labeled using a *relation
   inventory*.  The inventory describes the specific relationships that you'd
   like the model to differentiate (e.g. *part of* versus *composed of* versus
-   *purpose*), and generally may consist of tens of classes. 
+   *purpose*), and generally may consist of tens of classes.  You can download
-   You can download the dataset used in the paper from [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz).
+   the dataset used in the paper from
-2. You'll need a collection of word embeddings: the path-based model uses the
+   [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz).
-   word embeddings as part of the path representation, and the distributional
+2. A collection of word embeddings: the path-based model uses the word
-   models use the word embeddings directly as prediction features.
+   embeddings as part of the path representation, and the distributional models
+   use the word embeddings directly as prediction features.
 3. The path-based model requires a collection of syntactic dependency parses
-   that connect the constituents for each noun compound.
+   that connect the constituents for each noun compound. To generate these,
+   you'll need a corpus from which to train this data; we used Wikipedia and the
-At the moment, this repository does not contain the tools for generating this
+   [LDC GigaWord5](https://catalog.ldc.upenn.edu/LDC2011T07) corpora.
-data, but we will provide references to existing datasets and plan to add tools
-to generate the data in the future.
 # Contents
@@ -57,51 +56,125 @@ The following source code is included here:
 * `get_indicative_paths.py` is a script that generates the most indicative
  syntactic dependency paths for a particular relationship.
+Also included are utilities for preparing data for training:
+* `text_embeddings_to_binary.py` converts a text file containing word embeddings
+  into a binary file that is quicker to load.
+* `extract_paths.py` finds all the dependency paths that connect words in a
+  corpus.
+* `sorted_paths_to_examples.py` processes the output of `extract_paths.py` to
+  produce summarized training data.
+This code (in particular, the utilities used to prepare the data) differs from
+the code that was used to prepare data for the paper. Notably, we used a
+proprietary dependency parser instead of spaCy, which is used here.
 # Dependencies
 * [TensorFlow](http://www.tensorflow.org/): see detailed installation
  instructions at that site.
 * [SciKit Learn](http://scikit-learn.org/): you can probably just install this
  with `pip install sklearn`.
+* [SpaCy](https://spacy.io/): `pip install spacy` ought to do the trick, along
+  with the English model.
 # Creating the Model
-This section describes the necessary steps that you must follow to reproduce the
+This sections described the steps necessary to create and evaluate the model
-results in the paper.
+described in the paper.
+## Generate Path Data
+To begin, you need three text files:
+1. **Corpus**. This file should contain natural language sentences, written with
+   one sentence per line.  For purposes of exposition, we'll assume that you
+   have English Wikipedia serialized this way in `${HOME}/data/wiki.txt`.
+2. **Labeled Noun Compound Pairs**.  This file contain (modfier, head, label)
+   tuples, tab-separated, with one per line.  The *label* represented the
+   relationship between the head and the modifier; e.g., if `purpose` is one
+   your labels, you could possibly include `tooth<tab>paste<tab>purpose`.
+3. **Word Embeddings**. We used the
+   [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings; in
+   particular the 6B token, 300d variant.  We'll assume you have this file as
+   `${HOME}/data/glove.6B.300d.txt`.
+We first processed the embeddings from their text format into something that we
+can load a little bit more quickly:
+    ./text_embeddings_to_binary.py \
+      --input ${HOME}/data/glove.6B.300d.txt \
+      --output_vocab ${HOME}/data/vocab.txt \
+      --output_npy ${HOME}/data/glove.6B.300d.npy
+Next, we'll extract all the dependency parse paths connecting our labeled pairs
+from the corpus.  This process takes a *looooong* time, but is trivially
+parallelized using map-reduce if you have access to that technology.
+    ./extract_paths.py \
+      --corpus ${HOME}/data/wiki.txt \
+      --labeled_pairs ${HOME}/data/labeled-pairs.tsv \
+      --output ${HOME}/data/paths.tsv
+The file it produces (`paths.tsv`) is a tab-separated file that contains the
+modifier, the head, the label, the encoded path, and the sentence from which the
+path was drawn.  (This last is mostly for sanity checking.)  A sample row might
+look something like this (where newlines would actually be tab characters):
+    navy
+    captain
+    owner_emp_use
+    <X>/PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::<Y>/PROPN/compound/<
+    He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus
+This file must be sorted as follows:
+    sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv
-## Generate/Download Path Data
+In particular, rows with the same modifier, head, and label must appear
+contiguously.
-TBD! Our plan is to make the aggregate path data available that was used to
+We next create a file that contains all the relation labels from our original
-train path embeddings and classifiers; however, this will be released
+labeled pairs:
-separately.
-## Generate/Download Embedding Data
+    awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \
+      | sort -u > ${HOME}/data/relations.txt
-TBD! While we used the standard Glove vectors for the relata embeddings, the NC
+With these in hand, we're ready to produce the train, validation, and test data:
-embeddings were generated separately. Our plan is to make that data available,
-but it will be released separately.
+    ./sorted_paths_to_examples.py \
+       --input ${HOME}/data/sorted.paths.tsv \
+       --vocab ${HOME}/data/vocab.txt \
+       --relations ${HOME}/data/relations.txt \
+       --splits ${HOME}/data/splits.txt \
+       --output_dir ${HOME}/data
+Here, `splits.txt` is a file that indicates which "split" (train, test, or
+validation) you want the pair to appear in.  It should be a tab-separate file
+which conatins the modifier, head, and the dataset ( `train`, `test`, or `val`)
+into which the pair should be placed; e.g.,:
+    tooth <TAB> paste <TAB> train
+    banana <TAB> seat <TAB> test
+The program will produce a separate file for each dataset split in the directory
+specified by `--output_dir`.  Each file is contains `tf.train.Example` protocol
+buffers encoded using the `TFRecord` file format.
 ## Create Path Embeddings
-Create the path embeddings using `learn_path_embeddings.py`.  This shell script
+Now we're ready to train the path embeddings using `learn_path_embeddings.py`:
-fragment will iterate through each dataset, split, and corpus to generate path
-embeddings for each.
-    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
+    ./learn_path_embeddings.py \
-      for SPLIT in random lexical_head lexical_mod lexical_full ; do
+        --train ${HOME}/data/train.tfrecs.gz \
-        for CORPUS in wiki_gigiawords ; do
+        --val ${HOME}/data/val.tfrecs.gz \
-          python learn_path_embeddings.py \
+        --text ${HOME}/data/test.tfrecs.gz \
-            --dataset_dir ~/lexnet/datasets \
+        --embeddings ${HOME}/data/glove.6B.300d.npy
-            --dataset "${DATASET}" \
+        --relations ${HOME}/data/relations.txt
-            --corpus "${SPLIT}/${CORPUS}" \
+        --output ${HOME}/data/path-embeddings \
-            --embeddings_base_path ~/lexnet/embeddings \
        --logdir /tmp/learn_path_embeddings
-        done
-      done
-    done
-The path embeddings will be placed in the directory specified by
+The path embeddings will be placed at the location specified by `--output`.
-`--embeddings_base_path`.
 ## Train classifiers

--- a/research/lexnet_nc/extract_paths.py
+++ b/research/lexnet_nc/extract_paths.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import itertools
+import sys
+import spacy
+import tensorflow as tf
+tf.flags.DEFINE_string('corpus', '', 'Filename of corpus')
+tf.flags.DEFINE_string('labeled_pairs', '', 'Filename of labeled pairs')
+tf.flags.DEFINE_string('output', '', 'Filename of output file')
+FLAGS = tf.flags.FLAGS
+def get_path(mod_token, head_token):
+  """Returns the path between a modifier token and a head token."""
+  # Compute the path from the root to each token.
+  mod_ancestors = list(reversed(list(mod_token.ancestors)))
+  head_ancestors = list(reversed(list(head_token.ancestors)))
+  # If the paths don't start at the same place (odd!) then there is no path at
+  # all.
+  if (not mod_ancestors or not head_ancestors
+      or mod_ancestors[0] != head_ancestors[0]):
+    return None
+  # Eject elements from the common path until we reach the first differing
+  # ancestor.
+  ix = 1
+  while (ix < len(mod_ancestors) and ix < len(head_ancestors)
+         and mod_ancestors[ix] == head_ancestors[ix]):
+    ix += 1
+  # Construct the path.  TODO: add "satellites", possibly honor sentence
+  # ordering between modifier and head rather than just always traversing from
+  # the modifier to the head?
+  path = ['/'.join(('<X>', mod_token.pos_, mod_token.dep_, '>'))]
+  path += ['/'.join((tok.lemma_, tok.pos_, tok.dep_, '>'))
+           for tok in reversed(mod_ancestors[ix:])]
+  root_token = mod_ancestors[ix - 1]
+  path += ['/'.join((root_token.lemma_, root_token.pos_, root_token.dep_, '^'))]
+  path += ['/'.join((tok.lemma_, tok.pos_, tok.dep_, '<'))
+           for tok in head_ancestors[ix:]]
+  path += ['/'.join(('<Y>', head_token.pos_, head_token.dep_, '<'))]
+  return '::'.join(path)
+def main(_):
+  nlp = spacy.load('en_core_web_sm')
+  # Grab the set of labeled pairs for which we wish to collect paths.
+  with tf.gfile.GFile(FLAGS.labeled_pairs) as fh:
+    parts = (l.decode('utf-8').split('\t') for l in fh.read().splitlines())
+    labeled_pairs = {(mod, head): rel for mod, head, rel in parts}
+  # Create a mapping from each head to the modifiers that are used with it.
+  mods_for_head = {
+      head: set(hm[1] for hm in head_mods)
+      for head, head_mods in itertools.groupby(
+          sorted((head, mod) for (mod, head) in labeled_pairs.iterkeys()),
+          lambda (head, mod): head)}
+  # Collect all the heads that we know about.
+  heads = set(mods_for_head.keys())
+  # For each sentence that contains a (head, modifier) pair that's in our set,
+  # emit the dependency path that connects the pair.
+  out_fh = sys.stdout if not FLAGS.output else tf.gfile.GFile(FLAGS.output, 'w')
+  in_fh = sys.stdin if not FLAGS.corpus else tf.gfile.GFile(FLAGS.corpus)
+  num_paths = 0
+  for line, sen in enumerate(in_fh, start=1):
+    if line % 100 == 0:
+      print('\rProcessing line %d: %d paths' % (line, num_paths),
+            end='', file=sys.stderr)
+    sen = sen.decode('utf-8').strip()
+    doc = nlp(sen)
+    for head_token in doc:
+      head_text = head_token.text.lower()
+      if head_text in heads:
+        mods = mods_for_head[head_text]
+        for mod_token in doc:
+          mod_text = mod_token.text.lower()
+          if mod_text in mods:
+            path = get_path(mod_token, head_token)
+            if path:
+              label = labeled_pairs[(mod_text, head_text)]
+              line = '\t'.join((mod_text, head_text, label, path, sen))
+              print(line.encode('utf-8'), file=out_fh)
+              num_paths += 1
+  out_fh.close()
+if __name__ == '__main__':
+  tf.app.run()
--- a/research/lexnet_nc/learn_path_embeddings.py
+++ b/research/lexnet_nc/learn_path_embeddings.py
@@ -26,29 +26,13 @@ import path_model
 from sklearn import metrics
 import tensorflow as tf
-tf.flags.DEFINE_string(
+tf.flags.DEFINE_string('train', '', 'training dataset, tfrecs')
-    'dataset_dir', 'datasets',
+tf.flags.DEFINE_string('val', '', 'validation dataset, tfrecs')
-    'Dataset base directory')
+tf.flags.DEFINE_string('test', '', 'test dataset, tfrecs')
+tf.flags.DEFINE_string('embeddings', '', 'embeddings, npy')
-tf.flags.DEFINE_string(
+tf.flags.DEFINE_string('relations', '', 'file containing relation labels')
-    'dataset',
+tf.flags.DEFINE_string('output_dir', '', 'output directory for path embeddings')
-    'tratz/fine_grained',
+tf.flags.DEFINE_string('logdir', '', 'directory for model training')
-    'Subdirectory containing the corpus directories: '
-    'subdirectory of dataset_dir')
-tf.flags.DEFINE_string(
-    'corpus', 'random/wiki_gigawords',
-    'Subdirectory containing the corpus and split: '
-    'subdirectory of dataset_dir/dataset')
-tf.flags.DEFINE_string(
-    'embeddings_base_path', 'embeddings',
-    'Embeddings base directory')
-tf.flags.DEFINE_string(
-    'logdir', 'logdir',
-    'Directory of model output files')
 FLAGS = tf.flags.FLAGS
@@ -56,37 +40,26 @@ def main(_):
  # Pick up any one-off hyper-parameters.
  hparams = path_model.PathBasedModel.default_hparams()
-  # Set the number of classes
+  with open(FLAGS.relations) as fh:
-  classes_filename = os.path.join(
+    relations = fh.read().splitlines()
-      FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
-  with open(classes_filename) as f_in:
-    classes = f_in.read().splitlines()
-  hparams.num_classes = len(classes)
+  hparams.num_classes = len(relations)
  print('Model will predict into %d classes' % hparams.num_classes)
-  # Get the datasets
-  train_set, val_set, test_set = (
-      os.path.join(
-          FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
-          filename + '.tfrecs.gz')
-      for filename in ['train', 'val', 'test'])
  print('Running with hyper-parameters: {}'.format(hparams))
  # Load the instances
  print('Loading instances...')
  opts = tf.python_io.TFRecordOptions(
      compression_type=tf.python_io.TFRecordCompressionType.GZIP)
-  train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
-  val_instances = list(tf.python_io.tf_record_iterator(val_set, opts))
+  train_instances = list(tf.python_io.tf_record_iterator(FLAGS.train, opts))
-  test_instances = list(tf.python_io.tf_record_iterator(test_set, opts))
+  val_instances = list(tf.python_io.tf_record_iterator(FLAGS.val, opts))
+  test_instances = list(tf.python_io.tf_record_iterator(FLAGS.test, opts))
  # Load the word embeddings
  print('Loading word embeddings...')
-  lemma_embeddings = lexnet_common.load_word_embeddings(
+  lemma_embeddings = lexnet_common.load_word_embeddings(FLAGS.embeddings)
-      FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
  # Define the graph and the model
  with tf.Graph().as_default():
@@ -95,7 +68,7 @@ def main(_):
          compression_type=tf.python_io.TFRecordCompressionType.GZIP)
      reader = tf.TFRecordReader(options=options)
      _, train_instance = reader.read(
-          tf.train.string_input_producer([train_set]))
+          tf.train.string_input_producer([FLAGS.train]))
      shuffled_train_instance = tf.train.shuffle_batch(
          [train_instance],
          batch_size=1,
@@ -113,17 +86,13 @@ def main(_):
          hparams, lemma_embeddings, val_instance)
    # Initialize a session and start training
-    logdir = (
-        '{logdir}/results/{dataset}/path/{corpus}/supervisor.logdir'.format(
-            logdir=FLAGS.logdir, dataset=FLAGS.dataset, corpus=FLAGS.corpus))
    best_model_saver = tf.train.Saver()
    f1_t = tf.placeholder(tf.float32)
    best_f1_t = tf.Variable(0.0, trainable=False, name='best_f1')
    assign_best_f1_op = tf.assign(best_f1_t, f1_t)
    supervisor = tf.train.Supervisor(
-        logdir=logdir,
+        logdir=FLAGS.logdir,
        global_step=train_model.global_step)
    with supervisor.managed_session() as session:
@@ -131,11 +100,6 @@ def main(_):
      print('Loading labels...')
      val_labels = train_model.load_labels(session, val_instances)
-      save_path = '{logdir}/results/{dataset}/path/{corpus}/'.format(
-          logdir=FLAGS.logdir,
-          dataset=FLAGS.dataset,
-          corpus=FLAGS.corpus)
      # Train the model
      print('Training the model...')
@@ -152,13 +116,13 @@ def main(_):
        best_f1 = session.run(best_f1_t)
        f1 = epoch_completed(val_model, session, epoch, epoch_loss,
                             val_instances, val_labels, best_model_saver,
-                             save_path, best_f1)
+                             FLAGS.logdir, best_f1)
        if f1 > best_f1:
          session.run(assign_best_f1_op, {f1_t: f1})
        if f1 < best_f1 - 0.08:
-          tf.logging.fino('Stopping training after %d epochs.\n' % epoch)
+          tf.logging.info('Stopping training after %d epochs.\n' % epoch)
          break
      # Print the best performance on the validation set
@@ -170,16 +134,12 @@ def main(_):
      instances = train_instances + val_instances + test_instances
      path_index, path_vectors = path_model.compute_path_embeddings(
          val_model, session, instances)
-      path_emb_dir = '{dir}/path_embeddings/{dataset}/{corpus}/'.format(
-          dir=FLAGS.embeddings_base_path,
-          dataset=FLAGS.dataset,
-          corpus=FLAGS.corpus)
      if not os.path.exists(path_emb_dir):
        os.makedirs(path_emb_dir)
      path_model.save_path_embeddings(
-          val_model, path_vectors, path_index, path_emb_dir)
+          val_model, path_vectors, path_index, FLAGS.output_dir)
 def epoch_completed(model, session, epoch, epoch_loss,
@@ -214,9 +174,10 @@ def epoch_completed(model, session, epoch, epoch_loss,
          precision, recall, f1))
  if f1 > best_f1:
-    print('Saving model in: %s' % (save_path + 'best.ckpt'))
+    save_filename = os.path.join(save_path, 'best.ckpt')
-    saver.save(session, save_path + 'best.ckpt')
+    print('Saving model in: %s' % save_filename)
-    print('Model saved in file: %s' % (save_path + 'best.ckpt'))
+    saver.save(session, save_filename)
+    print('Model saved in file: %s' % save_filename)
  return f1

--- a/research/lexnet_nc/lexnet_common.py
+++ b/research/lexnet_nc/lexnet_common.py
@@ -55,30 +55,18 @@ DIRS = '_^V<>'
 DIR_TO_ID = {dir: did for did, dir in enumerate(DIRS)}
-def load_word_embeddings(word_embeddings_dir, word_embeddings_file):
+def load_word_embeddings(embedding_filename):
  """Loads pretrained word embeddings from a binary file and returns the matrix.
+  Adds the <PAD>, <UNK>, <X>, and <Y> tokens to the beginning of the vocab.
  Args:
-    word_embeddings_dir: The directory for the word embeddings.
+    embedding_filename: filename of the binary NPY data
-    word_embeddings_file: The pretrained word embeddings text file.
  Returns:
    The word embeddings matrix
  """
-  embedding_file = os.path.join(word_embeddings_dir, word_embeddings_file)
+  embeddings = np.load(embedding_filename)
-  vocab_file = os.path.join(
-      word_embeddings_dir, os.path.dirname(word_embeddings_file), 'vocab.txt')
-  with open(vocab_file) as f_in:
-    vocab = [line.strip() for line in f_in]
-  vocab_size = len(vocab)
-  print('Embedding file "%s" has %d tokens' % (embedding_file, vocab_size))
-  with open(embedding_file) as f_in:
-    embeddings = np.load(f_in)
  dim = embeddings.shape[1]
  # Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y>

--- a/research/lexnet_nc/path_model.py
+++ b/research/lexnet_nc/path_model.py
@@ -330,7 +330,7 @@ def _parse_tensorflow_example(record, max_path_len, input_keep_prob):
      lambda: tf.zeros([1, max_path_len, 4], dtype=tf.int64))
  # Paths are left-padded. We reverse them to make them right-padded.
-  paths = tf.reverse(paths, axis=[1])
+  #paths = tf.reverse(paths, axis=[1])
  path_counts = tf.cond(
      tf.shape(path_counts)[0] > 0,

--- a/research/lexnet_nc/sorted_paths_to_examples.py
+++ b/research/lexnet_nc/sorted_paths_to_examples.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Takes as input a sorted, tab-separated of paths to produce tf.Examples."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import itertools
+import os
+import sys
+import tensorflow as tf
+import lexnet_common
+tf.flags.DEFINE_string('input', '', 'tab-separated input data')
+tf.flags.DEFINE_string('vocab', '', 'a text file containing lemma vocabulary')
+tf.flags.DEFINE_string('relations', '', 'a text file containing the relations')
+tf.flags.DEFINE_string('output_dir', '', 'output directory')
+tf.flags.DEFINE_string('splits', '', 'text file enumerating splits')
+tf.flags.DEFINE_string('default_split', '', 'default split for unlabeled pairs')
+tf.flags.DEFINE_string('compression', 'GZIP', 'compression for output records')
+tf.flags.DEFINE_integer('max_paths', 100, 'maximum number of paths per record')
+tf.flags.DEFINE_integer('max_pathlen', 8, 'maximum path length')
+FLAGS = tf.flags.FLAGS
+def _int64_features(value):
+  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+def _bytes_features(value):
+  value = [v.encode('utf-8') if isinstance(v, unicode) else v for v in value]
+  return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
+class CreateExampleFn(object):
+  def __init__(self):
+    # Read the vocabulary.  N.B. that 0 = PAD, 1 = UNK, 2 = <X>, 3 = <Y>, hence
+    # the enumeration starting at 4.
+    with tf.gfile.GFile(FLAGS.vocab) as fh:
+      self.vocab = {w: ix for ix, w in enumerate(fh.read().splitlines(), start=4)}
+    self.vocab.update({'<PAD>': 0, '<UNK>': 1, '<X>': 2, '<Y>': 3})
+    # Read the relations.
+    with tf.gfile.GFile(FLAGS.relations) as fh:
+      self.relations = {r: ix for ix, r in enumerate(fh.read().splitlines())}
+    # Some hackery to map from SpaCy postags to Google's.
+    lexnet_common.POSTAG_TO_ID['PROPN'] = lexnet_common.POSTAG_TO_ID['NOUN']
+    lexnet_common.POSTAG_TO_ID['PRON'] = lexnet_common.POSTAG_TO_ID['NOUN']
+    lexnet_common.POSTAG_TO_ID['CCONJ'] = lexnet_common.POSTAG_TO_ID['CONJ']
+    #lexnet_common.DEPLABEL_TO_ID['relcl'] = lexnet_common.DEPLABEL_TO_ID['rel']
+    #lexnet_common.DEPLABEL_TO_ID['compound'] = lexnet_common.DEPLABEL_TO_ID['xcomp']
+    #lexnet_common.DEPLABEL_TO_ID['oprd'] = lexnet_common.DEPLABEL_TO_ID['UNK']
+  def __call__(self, mod, head, rel, raw_paths):
+    # Drop any really long paths.
+    paths = []
+    counts = []
+    for raw, count in raw_paths.most_common(FLAGS.max_paths):
+      path = raw.split('::')
+      if len(path) <= FLAGS.max_pathlen:
+        paths.append(path)
+        counts.append(count)
+    if not paths:
+      return None
+    # Compute the true length.
+    pathlens = [len(path) for path in paths]
+    # Pad each path out to max_pathlen so the LSTM can eat it.
+    paths = (
+        itertools.islice(
+            itertools.chain(path, itertools.repeat('<PAD>/PAD/PAD/_')),
+            FLAGS.max_pathlen)
+        for path in paths)
+    # Split the lemma, POS, dependency label, and direction each into a
+    # separate feature.
+    lemmas, postags, deplabels, dirs = zip(
+        *(part.split('/') for part in itertools.chain(*paths)))
+    lemmas = [self.vocab.get(lemma, 1) for lemma in lemmas]
+    postags = [lexnet_common.POSTAG_TO_ID[pos] for pos in postags]
+    deplabels = [lexnet_common.DEPLABEL_TO_ID.get(dep, 1) for dep in deplabels]
+    dirs = [lexnet_common.DIR_TO_ID.get(d, 0) for d in dirs]
+    return tf.train.Example(features=tf.train.Features(feature={
+        'pair': _bytes_features(['::'.join((mod, head))]),
+        'rel': _bytes_features([rel]),
+        'rel_id': _int64_features([self.relations[rel]]),
+        'reprs': _bytes_features(raw_paths),
+        'pathlens': _int64_features(pathlens),
+        'counts': _int64_features(counts),
+        'lemmas': _int64_features(lemmas),
+        'dirs': _int64_features(dirs),
+        'deplabels': _int64_features(deplabels),
+        'postags': _int64_features(postags),
+        'x_embedding_id': _int64_features([self.vocab[mod]]),
+        'y_embedding_id': _int64_features([self.vocab[head]]),
+    }))
+def main(_):
+  # Read the splits file, if there is one.
+  assignments = {}
+  if FLAGS.splits:
+    with tf.gfile.GFile(FLAGS.splits) as fh:
+      parts = (line.split('\t') for line in fh.read().splitlines())
+      assignments = {(mod, head): split for mod, head, split in parts}
+  splits = set(assignments.itervalues())
+  if FLAGS.default_split:
+    default_split = FLAGS.default_split
+    splits.add(FLAGS.default_split)
+  elif splits:
+    default_split = iter(splits).next()
+  else:
+    print('Please specify --splits, --default_split, or both', file=sys.stderr)
+    return 1
+  last_mod, last_head, last_label = None, None, None
+  raw_paths = collections.Counter()
+  # Keep track of pairs we've seen to ensure that we don't get unsorted data.
+  seen_labeled_pairs = set()
+  # Set up output compression
+  compression_type = getattr(
+      tf.python_io.TFRecordCompressionType, FLAGS.compression)
+  options = tf.python_io.TFRecordOptions(compression_type=compression_type)
+  writers = {
+      split: tf.python_io.TFRecordWriter(
+          os.path.join(FLAGS.output_dir, '%s.tfrecs.gz' % split),
+          options=options)
+      for split in splits}
+  create_example = CreateExampleFn()
+  in_fh = sys.stdin if not FLAGS.input else tf.gfile.GFile(FLAGS.input)
+  for lineno, line in enumerate(in_fh, start=1):
+    if lineno % 100 == 0:
+      print('\rProcessed %d lines...' % lineno, end='', file=sys.stderr)
+    parts = line.decode('utf-8').strip().split('\t')
+    if len(parts) != 5:
+      print('Skipping line %d: %d columns (expected 5)' % (
+          lineno, len(parts)), file=sys.stderr)
+      continue
+    mod, head, label, raw_path, source = parts
+    if mod == last_mod and head == last_head and label == last_label:
+      raw_paths.update([raw_path])
+      continue
+    if last_mod and last_head and last_label and raw_paths:
+      if (last_mod, last_head, last_label) in seen_labeled_pairs:
+        print('It looks like the input data is not sorted; ignoring extra '
+              'record for (%s::%s, %s) at line %d' % (
+                  last_mod, last_head, last_label, lineno))
+      else:
+        ex = create_example(last_mod, last_head, last_label, raw_paths)
+        if ex:
+          split = assignments.get((last_mod, last_head), default_split)
+          writers[split].write(ex.SerializeToString())
+        seen_labeled_pairs.add((last_mod, last_head, last_label))
+    last_mod, last_head, last_label = mod, head, label
+    raw_paths = collections.Counter()
+  if last_mod and last_head and last_label and raw_paths:
+    ex = create_example(last_mod, last_head, last_label, raw_paths)
+    if ex:
+      split = assignments.get((last_mod, last_head), default_split)
+      writers[split].write(ex.SerializeToString())
+  for writer in writers.itervalues():
+    writer.close()
+if __name__ == '__main__':
+  tf.app.run()
--- a/research/lexnet_nc/text_embeddings_to_binary.py
+++ b/research/lexnet_nc/text_embeddings_to_binary.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts a text embedding file into a binary format for quicker loading."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import tensorflow as tf
+tf.flags.DEFINE_string('input', '', 'text file containing embeddings')
+tf.flags.DEFINE_string('output_vocab', '', 'output file for vocabulary')
+tf.flags.DEFINE_string('output_npy', '', 'output file for binary')
+FLAGS = tf.flags.FLAGS
+def main(_):
+  vecs = []
+  vocab = []
+  with tf.gfile.GFile(FLAGS.input) as fh:
+    for line in fh:
+      parts = line.strip().split()
+      vocab.append(parts[0])
+      vecs.append([float(x) for x in parts[1:]])
+  with tf.gfile.GFile(FLAGS.output_vocab, 'w') as fh:
+    fh.write('\n'.join(vocab))
+    fh.write('\n')
+  vecs = np.array(vecs, dtype=np.float32)
+  np.save(FLAGS.output_npy, vecs, allow_pickle=False)
+if __name__ == '__main__':
+  tf.app.run()