Merge pull request #3683 from waterson/master

Add LexNet noun compounds model to models repository.

Merge pull request #3683 from waterson/master
Add LexNet noun compounds model to models repository.
deb772d5 · Lukasz Kaiser · GitHub · 2b775720 · 1f371534 · deb772d5
Unverified Commit deb772d5 authored Mar 21, 2018 by Lukasz Kaiser Committed by GitHub Mar 21, 2018
10 changed files
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -17,6 +17,7 @@
 /research/inception/ @shlens @vincentvanhoucke
 /research/learned_optimizer/ @olganw @nirum
 /research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
+/research/lexnet_nc/ @vered1986 @waterson
 /research/lfads/ @jazcollins @susillo
 /research/lm_1b/ @oriolvinyals @panyx0718
 /research/maskgan/ @a-dai

--- a/research/README.md
+++ b/research/README.md
@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
 -   [inception](inception): deep convolutional networks for computer vision.
 -   [learning_to_remember_rare_events](learning_to_remember_rare_events): a
    large-scale life-long memory module for use in deep learning.
+-   [lexnet_nc](lexnet_nc): a distributed model for noun compound relationship
+    classification.
 -   [lfads](lfads): sequential variational autoencoder for analyzing
    neuroscience data.
 -   [lm_1b](lm_1b): language modeling on the one billion word benchmark.

--- a/research/lexnet_nc/README.md
+++ b/research/lexnet_nc/README.md
+# LexNET for Noun Compound Relation Classification
+This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
+algorithm for classifying relationships, specifically applied to classifying the
+relationships that hold between noun compounds:
+* *olive oil* is oil that is *made from* olives
+* *cooking oil* which is oil that is *used for* cooking
+* *motor oil* is oil that is *contained in* a motor
+The model is a supervised classifier that predicts the relationship that holds
+between the constituents of a two-word noun compound using:
+1. A neural "paraphrase" of each syntactic dependency path that connects the
+   constituents in a large corpus. For example, given a sentence like *This fine
+   oil is made from first-press olives*, the dependency path is something like
+   `oil <NSUBJPASS made PREP> from POBJ> olive`.
+2. The distributional information provided by the individual words; i.e., the
+   word embeddings of the two consituents.
+3. The distributional signal provided by the compound itself; i.e., the
+   embedding of the noun compound in context.
+The model includes several variants: *path-based model* uses (1) alone, the
+*distributional model* uses (2) alone, and the *integrated model* uses (1) and
+(2).  The *distributional-nc model* and the *integrated-nc* model each add (3).
+Training a model requires the following:
+1. A collection of noun compounds that have been labeled using a *relation
+   inventory*.  The inventory describes the specific relationships that you'd
+   like the model to differentiate (e.g. *part of* versus *composed of* versus
+   *purpose*), and generally may consist of tens of classes.
+2. You'll need a collection of word embeddings: the path-based model uses the
+   word embeddings as part of the path representation, and the distributional
+   models use the word embeddings directly as prediction features.
+3. The path-based model requires a collection of syntactic dependency parses
+   that connect the constituents for each noun compound.
+At the moment, this repository does not contain the tools for generating this
+data, but we will provide references to existing datasets and plan to add tools
+to generate the data in the future.
+# Contents
+The following source code is included here:
+* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
+  model to predict a noun-compound relationship given labeled noun-compounds and
+  dependency parse paths.
+* `learn_classifier.py` is a script that trains and evaluates a classifier based
+  on any combination of paths, word embeddings, and noun-compound embeddings.
+* `get_indicative_paths.py` is a script that generates the most indicative
+  syntactic dependency paths for a particular relationship.
+# Dependencies
+* [TensorFlow](http://www.tensorflow.org/): see detailed installation
+  instructions at that site.
+* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
+  with `pip install sklearn`.
+# Creating the Model
+This section describes the necessary steps that you must follow to reproduce the
+results in the paper.
+## Generate/Download Path Data
+TBD! Our plan is to make the aggregate path data available that was used to
+train path embeddings and classifiers; however, this will be released
+separately.
+## Generate/Download Embedding Data
+TBD! While we used the standard Glove vectors for the relata embeddings, the NC
+embeddings were generated separately. Our plan is to make that data available,
+but it will be released separately.
+## Create Path Embeddings
+Create the path embeddings using `learn_path_embeddings.py`.  This shell script
+fragment will iterate through each dataset, split, and corpus to generate path
+embeddings for each.
+    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
+      for SPLIT in random lexical_head lexical_mod lexical_full ; do
+        for CORPUS in wiki_gigiawords ; do
+          python learn_path_embeddings.py \
+            --dataset_dir ~/lexnet/datasets \
+            --dataset "${DATASET}" \
+            --corpus "${SPLIT}/${CORPUS}" \
+            --embeddings_base_path ~/lexnet/embeddings \
+            --logdir /tmp/learn_path_embeddings
+        done
+      done
+    done
+The path embeddings will be placed in the directory specified by
+`--embeddings_base_path`.
+## Train classifiers
+Train classifiers and evaluate on the validation and test data using
+`train_classifiers.py` script.  This shell script fragment will iterate through
+each dataset, split, corpus, and model type to train and evaluate classifiers.
+    LOGDIR=/tmp/learn_classifier
+    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
+      for SPLIT in random lexical_head lexical_mod lexical_full ; do
+        for CORPUS in wiki_gigiawords ; do
+          for MODEL in dist dist-nc path integrated integrated-nc ; do
+            # Filename for the log that will contain the classifier results.
+            LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
+            python learn_classifier.py \
+              --dataset_dir ~/lexnet/datasets \
+              --dataset "${DATASET}" \
+              --corpus "${SPLIT}/${CORPUS}" \
+              --embeddings_base_path ~/lexnet/embeddings \
+              --logdir ${LOGDIR} \
+              --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
+          done
+        done
+      done
+    done
+The log file will contain the final performance (precision, recall, F1) on the
+train, dev, and test sets, and will include a confusion matrix for each.
+# Contact
+If you have any questions, issues, or suggestions, feel free to contact either
+@vered1986 or @waterson.
--- a/research/lexnet_nc/get_indicative_paths.py
+++ b/research/lexnet_nc/get_indicative_paths.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Extracts paths that are indicative of each relation."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import tensorflow as tf
+from . import path_model
+from . import lexnet_common
+tf.flags.DEFINE_string(
+    'dataset_dir', 'datasets',
+    'Dataset base directory')
+tf.flags.DEFINE_string(
+    'dataset',
+    'tratz/fine_grained',
+    'Subdirectory containing the corpus directories: '
+    'subdirectory of dataset_dir')
+tf.flags.DEFINE_string(
+    'corpus', 'random/wiki',
+    'Subdirectory containing the corpus and split: '
+    'subdirectory of dataset_dir/dataset')
+tf.flags.DEFINE_string(
+    'embeddings_base_path', 'embeddings',
+    'Embeddings base directory')
+tf.flags.DEFINE_string(
+    'logdir', 'logdir',
+    'Directory of model output files')
+tf.flags.DEFINE_integer(
+    'top_k', 20, 'Number of top paths to extract')
+tf.flags.DEFINE_float(
+    'threshold', 0.8, 'Threshold above which to consider paths as indicative')
+FLAGS = tf.flags.FLAGS
+def main(_):
+  hparams = path_model.PathBasedModel.default_hparams()
+  # First things first. Load the path data.
+  path_embeddings_file = 'path_embeddings/{dataset}/{corpus}'.format(
+      dataset=FLAGS.dataset,
+      corpus=FLAGS.corpus)
+  path_dim = (hparams.lemma_dim + hparams.pos_dim +
+              hparams.dep_dim + hparams.dir_dim)
+  path_embeddings, path_to_index = path_model.load_path_embeddings(
+      os.path.join(FLAGS.embeddings_base_path, path_embeddings_file),
+      path_dim)
+  # Load and count the classes so we can correctly instantiate the model.
+  classes_filename = os.path.join(
+      FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
+  with open(classes_filename) as f_in:
+    classes = f_in.read().splitlines()
+  hparams.num_classes = len(classes)
+  # We need the word embeddings to instantiate the model, too.
+  print('Loading word embeddings...')
+  lemma_embeddings = lexnet_common.load_word_embeddings(
+      FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
+  # Instantiate the model.
+  with tf.Graph().as_default():
+    with tf.variable_scope('lexnet'):
+      instance = tf.placeholder(dtype=tf.string)
+      model = path_model.PathBasedModel(
+          hparams, lemma_embeddings, instance)
+    with tf.Session() as session:
+      model_dir = '{logdir}/results/{dataset}/path/{corpus}'.format(
+          logdir=FLAGS.logdir,
+          dataset=FLAGS.dataset,
+          corpus=FLAGS.corpus)
+      saver = tf.train.Saver()
+      saver.restore(session, os.path.join(model_dir, 'best.ckpt'))
+      path_model.get_indicative_paths(
+          model, session, path_to_index, path_embeddings, classes,
+          model_dir, FLAGS.top_k, FLAGS.threshold)
+if __name__ == '__main__':
+  tf.app.run()
--- a/research/lexnet_nc/learn_classifier.py
+++ b/research/lexnet_nc/learn_classifier.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Trains the integrated LexNET classifier."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import lexnet_common
+import lexnet_model
+import path_model
+from sklearn import metrics
+import tensorflow as tf
+tf.flags.DEFINE_string(
+    'dataset_dir', 'datasets',
+    'Dataset base directory')
+tf.flags.DEFINE_string(
+    'dataset', 'tratz/fine_grained',
+    'Subdirectory containing the corpus directories: '
+    'subdirectory of dataset_dir')
+tf.flags.DEFINE_string(
+    'corpus', 'wiki/random',
+    'Subdirectory containing the corpus and split: '
+    'subdirectory of dataset_dir/dataset')
+tf.flags.DEFINE_string(
+    'embeddings_base_path', 'embeddings',
+    'Embeddings base directory')
+tf.flags.DEFINE_string(
+    'logdir', 'logdir',
+    'Directory of model output files')
+tf.flags.DEFINE_string('hparams', '', 'Hyper-parameters')
+tf.flags.DEFINE_string(
+    'input', 'integrated',
+    'The model(dist/dist-nc/path/integrated/integrated-nc')
+FLAGS = tf.flags.FLAGS
+def main(_):
+  # Pick up any one-off hyper-parameters.
+  hparams = lexnet_model.LexNETModel.default_hparams()
+  hparams.corpus = FLAGS.corpus
+  hparams.input = FLAGS.input
+  hparams.path_embeddings_file = 'path_embeddings/%s/%s' % (
+      FLAGS.dataset, FLAGS.corpus)
+  input_dir = hparams.input if hparams.input != 'path' else 'path_classifier'
+  # Set the number of classes
+  classes_filename = os.path.join(
+      FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
+  with open(classes_filename) as f_in:
+    classes = f_in.read().splitlines()
+  hparams.num_classes = len(classes)
+  print('Model will predict into %d classes' % hparams.num_classes)
+  # Get the datasets
+  train_set, val_set, test_set = (
+      os.path.join(
+          FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
+          filename + '.tfrecs.gz')
+      for filename in ['train', 'val', 'test'])
+  print('Running with hyper-parameters: {}'.format(hparams))
+  # Load the instances
+  print('Loading instances...')
+  opts = tf.python_io.TFRecordOptions(
+      compression_type=tf.python_io.TFRecordCompressionType.GZIP)
+  train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
+  val_instances = list(tf.python_io.tf_record_iterator(val_set, opts))
+  test_instances = list(tf.python_io.tf_record_iterator(test_set, opts))
+  # Load the word embeddings
+  print('Loading word embeddings...')
+  relata_embeddings, path_embeddings, nc_embeddings, path_to_index = (
+      None, None, None, None)
+  if hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
+    relata_embeddings = lexnet_common.load_word_embeddings(
+        FLAGS.embeddings_base_path, hparams.relata_embeddings_file)
+  if hparams.input in ['path', 'integrated', 'integrated-nc']:
+    path_embeddings, path_to_index = path_model.load_path_embeddings(
+        os.path.join(FLAGS.embeddings_base_path, hparams.path_embeddings_file),
+        hparams.path_dim)
+  if hparams.input in ['dist-nc', 'integrated-nc']:
+    nc_embeddings = lexnet_common.load_word_embeddings(
+        FLAGS.embeddings_base_path, hparams.nc_embeddings_file)
+  # Define the graph and the model
+  with tf.Graph().as_default():
+    model = lexnet_model.LexNETModel(
+        hparams, relata_embeddings, path_embeddings,
+        nc_embeddings, path_to_index)
+    # Initialize a session and start training
+    session = tf.Session()
+    session.run(tf.global_variables_initializer())
+    # Initalize the path mapping
+    if hparams.input in ['path', 'integrated', 'integrated-nc']:
+      session.run(tf.tables_initializer())
+      session.run(model.initialize_path_op, {
+          model.path_initial_value_t: path_embeddings
+      })
+    # Initialize the NC embeddings
+    if hparams.input in ['dist-nc', 'integrated-nc']:
+      session.run(model.initialize_nc_op, {
+          model.nc_initial_value_t: nc_embeddings
+      })
+    # Load the labels
+    print('Loading labels...')
+    train_labels = model.load_labels(session, train_instances)
+    val_labels = model.load_labels(session, val_instances)
+    test_labels = model.load_labels(session, test_instances)
+    save_path = '{logdir}/results/{dataset}/{input}/{corpus}'.format(
+        logdir=FLAGS.logdir, dataset=FLAGS.dataset,
+        corpus=model.hparams.corpus, input=input_dir)
+    if not os.path.exists(save_path):
+      os.makedirs(save_path)
+    # Train the model
+    print('Training the model...')
+    model.fit(session, train_instances, epoch_completed,
+              val_instances, val_labels, save_path)
+    # Print the best performance on the validation set
+    print('Best performance on the validation set: F1=%.3f' %
+          epoch_completed.best_f1)
+    # Evaluate on the train and validation sets
+    lexnet_common.full_evaluation(model, session, train_instances, train_labels,
+                                  'Train', classes)
+    lexnet_common.full_evaluation(model, session, val_instances, val_labels,
+                                  'Validation', classes)
+    test_predictions = lexnet_common.full_evaluation(
+        model, session, test_instances, test_labels, 'Test', classes)
+    # Write the test predictions to a file
+    predictions_file = os.path.join(save_path, 'test_predictions.tsv')
+    print('Saving test predictions to %s' % save_path)
+    test_pairs = model.load_pairs(session, test_instances)
+    lexnet_common.write_predictions(test_pairs, test_labels, test_predictions,
+                                    classes, predictions_file)
+def epoch_completed(model, session, epoch, epoch_loss,
+                    val_instances, val_labels, save_path):
+  """Runs every time an epoch completes.
+  Print the performance on the validation set, and update the saved model if
+  its performance is better on the previous ones. If the performance dropped,
+  tell the training to stop.
+  Args:
+    model: The currently trained path-based model.
+    session: The current TensorFlow session.
+    epoch: The epoch number.
+    epoch_loss: The current epoch loss.
+    val_instances: The validation set instances (evaluation between epochs).
+    val_labels: The validation set labels (for evaluation between epochs).
+    save_path: Where to save the model.
+  Returns:
+    whether the training should stop.
+  """
+  stop_training = False
+  # Evaluate on the validation set
+  val_pred = model.predict(session, val_instances)
+  precision, recall, f1, _ = metrics.precision_recall_fscore_support(
+      val_labels, val_pred, average='weighted')
+  print(
+      'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f\n' % (
+          epoch + 1, model.hparams.num_epochs, epoch_loss,
+          precision, recall, f1))
+  # If the F1 is much smaller than the previous one, stop training. Else, if
+  # it's bigger, save the model.
+  if f1 < epoch_completed.best_f1 - 0.08:
+    stop_training = True
+  if f1 > epoch_completed.best_f1:
+    saver = tf.train.Saver()
+    checkpoint_filename = os.path.join(save_path, 'best.ckpt')
+    print('Saving model in: %s' % checkpoint_filename)
+    saver.save(session, checkpoint_filename)
+    print('Model saved in file: %s' % checkpoint_filename)
+    epoch_completed.best_f1 = f1
+  return stop_training
+epoch_completed.best_f1 = 0
+if __name__ == '__main__':
+  tf.app.run(main)
--- a/research/lexnet_nc/learn_path_embeddings.py
+++ b/research/lexnet_nc/learn_path_embeddings.py
+#!/usr/bin/env python
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Trains the LexNET path-based model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import lexnet_common
+import path_model
+from sklearn import metrics
+import tensorflow as tf
+tf.flags.DEFINE_string(
+    'dataset_dir', 'datasets',
+    'Dataset base directory')
+tf.flags.DEFINE_string(
+    'dataset',
+    'tratz/fine_grained',
+    'Subdirectory containing the corpus directories: '
+    'subdirectory of dataset_dir')
+tf.flags.DEFINE_string(
+    'corpus', 'random/wiki_gigawords',
+    'Subdirectory containing the corpus and split: '
+    'subdirectory of dataset_dir/dataset')
+tf.flags.DEFINE_string(
+    'embeddings_base_path', 'embeddings',
+    'Embeddings base directory')
+tf.flags.DEFINE_string(
+    'logdir', 'logdir',
+    'Directory of model output files')
+FLAGS = tf.flags.FLAGS
+def main(_):
+  # Pick up any one-off hyper-parameters.
+  hparams = path_model.PathBasedModel.default_hparams()
+  # Set the number of classes
+  classes_filename = os.path.join(
+      FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
+  with open(classes_filename) as f_in:
+    classes = f_in.read().splitlines()
+  hparams.num_classes = len(classes)
+  print('Model will predict into %d classes' % hparams.num_classes)
+  # Get the datasets
+  train_set, val_set, test_set = (
+      os.path.join(
+          FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
+          filename + '.tfrecs.gz')
+      for filename in ['train', 'val', 'test'])
+  print('Running with hyper-parameters: {}'.format(hparams))
+  # Load the instances
+  print('Loading instances...')
+  opts = tf.python_io.TFRecordOptions(
+      compression_type=tf.python_io.TFRecordCompressionType.GZIP)
+  train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
+  val_instances = list(tf.python_io.tf_record_iterator(val_set, opts))
+  test_instances = list(tf.python_io.tf_record_iterator(test_set, opts))
+  # Load the word embeddings
+  print('Loading word embeddings...')
+  lemma_embeddings = lexnet_common.load_word_embeddings(
+      FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
+  # Define the graph and the model
+  with tf.Graph().as_default():
+    with tf.variable_scope('lexnet'):
+      options = tf.python_io.TFRecordOptions(
+          compression_type=tf.python_io.TFRecordCompressionType.GZIP)
+      reader = tf.TFRecordReader(options=options)
+      _, train_instance = reader.read(
+          tf.train.string_input_producer([train_set]))
+      shuffled_train_instance = tf.train.shuffle_batch(
+          [train_instance],
+          batch_size=1,
+          num_threads=1,
+          capacity=len(train_instances),
+          min_after_dequeue=100,
+      )[0]
+      train_model = path_model.PathBasedModel(
+          hparams, lemma_embeddings, shuffled_train_instance)
+    with tf.variable_scope('lexnet', reuse=True):
+      val_instance = tf.placeholder(dtype=tf.string)
+      val_model = path_model.PathBasedModel(
+          hparams, lemma_embeddings, val_instance)
+    # Initialize a session and start training
+    logdir = (
+        '{logdir}/results/{dataset}/path/{corpus}/supervisor.logdir'.format(
+            logdir=FLAGS.logdir, dataset=FLAGS.dataset, corpus=FLAGS.corpus))
+    best_model_saver = tf.train.Saver()
+    f1_t = tf.placeholder(tf.float32)
+    best_f1_t = tf.Variable(0.0, trainable=False, name='best_f1')
+    assign_best_f1_op = tf.assign(best_f1_t, f1_t)
+    supervisor = tf.train.Supervisor(
+        logdir=logdir,
+        global_step=train_model.global_step)
+    with supervisor.managed_session() as session:
+      # Load the labels
+      print('Loading labels...')
+      val_labels = train_model.load_labels(session, val_instances)
+      save_path = '{logdir}/results/{dataset}/path/{corpus}/'.format(
+          logdir=FLAGS.logdir,
+          dataset=FLAGS.dataset,
+          corpus=FLAGS.corpus)
+      # Train the model
+      print('Training the model...')
+      while True:
+        step = session.run(train_model.global_step)
+        epoch = (step + len(train_instances) - 1) // len(train_instances)
+        if epoch > hparams.num_epochs:
+          break
+        print('Starting epoch %d (step %d)...' % (1 + epoch, step))
+        epoch_loss = train_model.run_one_epoch(session, len(train_instances))
+        best_f1 = session.run(best_f1_t)
+        f1 = epoch_completed(val_model, session, epoch, epoch_loss,
+                             val_instances, val_labels, best_model_saver,
+                             save_path, best_f1)
+        if f1 > best_f1:
+          session.run(assign_best_f1_op, {f1_t: f1})
+        if f1 < best_f1 - 0.08:
+          tf.logging.fino('Stopping training after %d epochs.\n' % epoch)
+          break
+      # Print the best performance on the validation set
+      best_f1 = session.run(best_f1_t)
+      print('Best performance on the validation set: F1=%.3f' % best_f1)
+      # Save the path embeddings
+      print('Computing the path embeddings...')
+      instances = train_instances + val_instances + test_instances
+      path_index, path_vectors = path_model.compute_path_embeddings(
+          val_model, session, instances)
+      path_emb_dir = '{dir}/path_embeddings/{dataset}/{corpus}/'.format(
+          dir=FLAGS.embeddings_base_path,
+          dataset=FLAGS.dataset,
+          corpus=FLAGS.corpus)
+      if not os.path.exists(path_emb_dir):
+        os.makedirs(path_emb_dir)
+      path_model.save_path_embeddings(
+          val_model, path_vectors, path_index, path_emb_dir)
+def epoch_completed(model, session, epoch, epoch_loss,
+                    val_instances, val_labels, saver, save_path, best_f1):
+  """Runs every time an epoch completes.
+  Print the performance on the validation set, and update the saved model if
+  its performance is better on the previous ones. If the performance dropped,
+  tell the training to stop.
+  Args:
+    model: The currently trained path-based model.
+    session: The current TensorFlow session.
+    epoch: The epoch number.
+    epoch_loss: The current epoch loss.
+    val_instances: The validation set instances (evaluation between epochs).
+    val_labels: The validation set labels (for evaluation between epochs).
+    saver: tf.Saver object
+    save_path: Where to save the model.
+    best_f1: the best F1 achieved so far.
+  Returns:
+    The F1 achieved on the training set.
+  """
+  # Evaluate on the validation set
+  val_pred = model.predict(session, val_instances)
+  precision, recall, f1, _ = metrics.precision_recall_fscore_support(
+      val_labels, val_pred, average='weighted')
+  print(
+      'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f\n' % (
+          epoch + 1, model.hparams.num_epochs, epoch_loss,
+          precision, recall, f1))
+  if f1 > best_f1:
+    print('Saving model in: %s' % (save_path + 'best.ckpt'))
+    saver.save(session, save_path + 'best.ckpt')
+    print('Model saved in file: %s' % (save_path + 'best.ckpt'))
+  return f1
+if __name__ == '__main__':
+  tf.app.run(main)
--- a/research/lexnet_nc/lexnet_common.py
+++ b/research/lexnet_nc/lexnet_common.py
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Common stuff used with LexNET."""
+# pylint: disable=bad-whitespace
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import numpy as np
+from sklearn import metrics
+import tensorflow as tf
+# Part of speech tags used in the paths.
+POSTAGS = [
+    'PAD',   'VERB',   'CONJ',   'NOUN',   'PUNCT',
+    'ADP',   'ADJ',    'DET',    'ADV',    'PART',
+    'NUM',   'X',      'INTJ',   'SYM',
+]
+POSTAG_TO_ID = {tag: tid for tid, tag in enumerate(POSTAGS)}
+# Dependency labels used in the paths.
+DEPLABELS = [
+    'PAD',     'UNK',       'ROOT',    'abbrev',    'acomp', 'advcl',
+    'advmod',  'agent',     'amod',    'appos',     'attr',  'aux',
+    'auxpass', 'cc',        'ccomp',   'complm',    'conj',  'cop',
+    'csubj',   'csubjpass', 'dep',     'det',       'dobj',  'expl',
+    'infmod',  'iobj',      'mark',    'mwe',       'nc',    'neg',
+    'nn',      'npadvmod',  'nsubj',   'nsubjpass', 'num',   'number',
+    'p',       'parataxis', 'partmod', 'pcomp',     'pobj',  'poss',
+    'preconj', 'predet',    'prep',    'prepc',     'prt',   'ps',
+    'purpcl',  'quantmod',  'rcmod',   'ref',       'rel',   'suffix',
+    'title',   'tmod',      'xcomp',   'xsubj',
+]
+DEPLABEL_TO_ID = {label: lid for lid, label in enumerate(DEPLABELS)}
+# Direction codes used in the paths.
+DIRS = '_^V<>'
+DIR_TO_ID = {dir: did for did, dir in enumerate(DIRS)}
+def load_word_embeddings(word_embeddings_dir, word_embeddings_file):
+  """Loads pretrained word embeddings from a binary file and returns the matrix.
+  Args:
+    word_embeddings_dir: The directory for the word embeddings.
+    word_embeddings_file: The pretrained word embeddings text file.
+  Returns:
+    The word embeddings matrix
+  """
+  embedding_file = os.path.join(word_embeddings_dir, word_embeddings_file)
+  vocab_file = os.path.join(
+      word_embeddings_dir, os.path.dirname(word_embeddings_file), 'vocab.txt')
+  with open(vocab_file) as f_in:
+    vocab = [line.strip() for line in f_in]
+  vocab_size = len(vocab)
+  print('Embedding file "%s" has %d tokens' % (embedding_file, vocab_size))
+  with open(embedding_file) as f_in:
+    embeddings = np.load(f_in)
+  dim = embeddings.shape[1]
+  # Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y>
+  special_embeddings = np.random.normal(0, 0.1, (4, dim))
+  embeddings = np.vstack((special_embeddings, embeddings))
+  embeddings = embeddings.astype(np.float32)
+  return embeddings
+def full_evaluation(model, session, instances, labels, set_name, classes):
+  """Prints a full evaluation on the current set.
+  Performance (recall, precision and F1), classification report (per
+  class performance), and confusion matrix).
+  Args:
+    model: The currently trained path-based model.
+    session: The current TensorFlow session.
+    instances: The current set instances.
+    labels: The current set labels.
+    set_name: The current set name (train/validation/test).
+    classes: The class label names.
+  Returns:
+    The model's prediction for the given instances.
+  """
+  # Predict the labels
+  pred = model.predict(session, instances)
+  # Print the performance
+  precision, recall, f1, _ = metrics.precision_recall_fscore_support(
+      labels, pred, average='weighted')
+  print('%s set: Precision: %.3f, Recall: %.3f, F1: %.3f' % (
+      set_name, precision, recall, f1))
+  # Print a classification report
+  print('%s classification report:' % set_name)
+  print(metrics.classification_report(labels, pred, target_names=classes))
+  # Print the confusion matrix
+  print('%s confusion matrix:' % set_name)
+  cm = metrics.confusion_matrix(labels, pred, labels=range(len(classes)))
+  cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
+  print_cm(cm, labels=classes)
+  return pred
+def print_cm(cm, labels):
+  """Pretty print for confusion matrices.
+  From: https://gist.github.com/zachguo/10296432.
+  Args:
+    cm: The confusion matrix.
+    labels: The class names.
+  """
+  columnwidth = 10
+  empty_cell = ' ' * columnwidth
+  short_labels = [label[:12].rjust(10, ' ') for label in labels]
+  # Print header
+  header = empty_cell + ' '
+  header += ''.join([' %{0}s '.format(columnwidth) % label
+                     for label in short_labels])
+  print(header)
+  # Print rows
+  for i, label1 in enumerate(short_labels):
+    row = '%{0}s '.format(columnwidth) % label1[:10]
+    for j in range(len(short_labels)):
+      value = int(cm[i, j]) if not np.isnan(cm[i, j]) else 0
+      cell = ' %{0}d '.format(10) % value
+      row += cell + ' '
+    print(row)
+def load_all_labels(records):
+  """Reads TensorFlow examples from a RecordReader and returns only the labels.
+  Args:
+    records: a record list with TensorFlow examples.
+  Returns:
+    The labels
+  """
+  curr_features = tf.parse_example(records, {
+      'rel_id': tf.FixedLenFeature([1], dtype=tf.int64),
+  })
+  labels = tf.squeeze(curr_features['rel_id'], [-1])
+  return labels
+def load_all_pairs(records):
+  """Reads TensorFlow examples from a RecordReader and returns the word pairs.
+  Args:
+    records: a record list with TensorFlow examples.
+  Returns:
+    The word pairs
+  """
+  curr_features = tf.parse_example(records, {
+      'pair': tf.FixedLenFeature([1], dtype=tf.string)
+  })
+  word_pairs = curr_features['pair']
+  return word_pairs
+def write_predictions(pairs, labels, predictions, classes, predictions_file):
+  """Write the predictions to a file.
+  Args:
+    pairs: the word pairs (list of tuple of two strings).
+    labels: the gold-standard labels for these pairs (array of rel ID).
+    predictions: the predicted labels for these pairs (array of rel ID).
+    classes: a list of relation names.
+    predictions_file: where to save the predictions.
+  """
+  with open(predictions_file, 'w') as f_out:
+    for pair, label, pred in zip(pairs, labels, predictions):
+      w1, w2 = pair
+      f_out.write('\t'.join([w1, w2, classes[label], classes[pred]]) + '\n')
--- a/research/lexnet_nc/lexnet_model.py
+++ b/research/lexnet_nc/lexnet_model.py
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""The integrated LexNET model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import lexnet_common
+import numpy as np
+import tensorflow as tf
+class LexNETModel(object):
+  """The LexNET model for classifying relationships between noun compounds."""
+  @classmethod
+  def default_hparams(cls):
+    """Returns the default hyper-parameters."""
+    return tf.contrib.training.HParams(
+        batch_size=10,
+        num_classes=37,
+        num_epochs=30,
+        input_keep_prob=0.9,
+        input='integrated',  # dist/ dist-nc/ path/ integrated/ integrated-nc
+        learn_relata=False,
+        corpus='wiki_gigawords',
+        random_seed=133,  # zero means no random seed
+        relata_embeddings_file='glove/glove.6B.300d.bin',
+        nc_embeddings_file='nc_glove/vecs.6B.300d.bin',
+        path_embeddings_file='path_embeddings/tratz/fine_grained/wiki',
+        hidden_layers=1,
+        path_dim=60)
+  def __init__(self, hparams, relata_embeddings, path_embeddings, nc_embeddings,
+               path_to_index):
+    """Initialize the LexNET classifier.
+    Args:
+      hparams: the hyper-parameters.
+      relata_embeddings: word embeddings for the distributional component.
+      path_embeddings: embeddings for the paths.
+      nc_embeddings: noun compound embeddings.
+      path_to_index: a mapping from string path to an index in the path
+      embeddings matrix.
+    """
+    self.hparams = hparams
+    self.path_embeddings = path_embeddings
+    self.relata_embeddings = relata_embeddings
+    self.nc_embeddings = nc_embeddings
+    self.vocab_size, self.relata_dim = 0, 0
+    self.path_to_index = None
+    self.path_dim = 0
+    # Set the random seed
+    if hparams.random_seed > 0:
+      tf.set_random_seed(hparams.random_seed)
+    # Get the vocabulary size and relata dim
+    if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
+      self.vocab_size, self.relata_dim = self.relata_embeddings.shape
+    # Create the mapping from string path to an index in the embeddings matrix
+    if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
+      self.path_to_index = tf.contrib.lookup.HashTable(
+          tf.contrib.lookup.KeyValueTensorInitializer(
+              tf.constant(path_to_index.keys()),
+              tf.constant(path_to_index.values()),
+              key_dtype=tf.string, value_dtype=tf.int32), 0)
+      self.path_dim = self.path_embeddings.shape[1]
+    # Create the network
+    self.__create_computation_graph__()
+  def __create_computation_graph__(self):
+    """Initialize the model and define the graph."""
+    network_input = 0
+    # Define the network inputs
+    # Distributional x and y
+    if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
+      network_input += 2 * self.relata_dim
+      self.relata_lookup = tf.get_variable(
+          'relata_lookup',
+          initializer=self.relata_embeddings,
+          dtype=tf.float32,
+          trainable=self.hparams.learn_relata)
+    # Path-based
+    if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
+      network_input += self.path_dim
+      self.path_initial_value_t = tf.placeholder(tf.float32, None)
+      self.path_lookup = tf.get_variable(
+          name='path_lookup',
+          dtype=tf.float32,
+          trainable=False,
+          shape=self.path_embeddings.shape)
+      self.initialize_path_op = tf.assign(
+          self.path_lookup, self.path_initial_value_t, validate_shape=False)
+    # Distributional noun compound
+    if self.hparams.input in ['dist-nc', 'integrated-nc']:
+      network_input += self.relata_dim
+      self.nc_initial_value_t = tf.placeholder(tf.float32, None)
+      self.nc_lookup = tf.get_variable(
+          name='nc_lookup',
+          dtype=tf.float32,
+          trainable=False,
+          shape=self.nc_embeddings.shape)
+      self.initialize_nc_op = tf.assign(
+          self.nc_lookup, self.nc_initial_value_t, validate_shape=False)
+    hidden_dim = network_input // 2
+    # Define the MLP
+    if self.hparams.hidden_layers == 0:
+      self.weights1 = tf.get_variable(
+          'W1',
+          shape=[network_input, self.hparams.num_classes],
+          dtype=tf.float32)
+      self.bias1 = tf.get_variable(
+          'b1',
+          shape=[self.hparams.num_classes],
+          dtype=tf.float32)
+    elif self.hparams.hidden_layers == 1:
+      self.weights1 = tf.get_variable(
+          'W1',
+          shape=[network_input, hidden_dim],
+          dtype=tf.float32)
+      self.bias1 = tf.get_variable(
+          'b1',
+          shape=[hidden_dim],
+          dtype=tf.float32)
+      self.weights2 = tf.get_variable(
+          'W2',
+          shape=[hidden_dim, self.hparams.num_classes],
+          dtype=tf.float32)
+      self.bias2 = tf.get_variable(
+          'b2',
+          shape=[self.hparams.num_classes],
+          dtype=tf.float32)
+    else:
+      raise ValueError('Only 0 or 1 hidden layers are supported')
+    # Define the variables
+    self.instances = tf.placeholder(dtype=tf.string,
+                                    shape=[self.hparams.batch_size])
+    (self.x_embedding_id,
+     self.y_embedding_id,
+     self.nc_embedding_id,
+     self.path_embedding_id,
+     self.path_counts,
+     self.labels) = parse_tensorflow_examples(
+         self.instances, self.hparams.batch_size, self.path_to_index)
+    # Create the MLP
+    self.__mlp__()
+    self.instances_to_load = tf.placeholder(dtype=tf.string, shape=[None])
+    self.labels_to_load = lexnet_common.load_all_labels(self.instances_to_load)
+    self.pairs_to_load = lexnet_common.load_all_pairs(self.instances_to_load)
+  def load_labels(self, session, instances):
+    """Loads the labels for these instances.
+    Args:
+      session: The current TensorFlow session,
+      instances: The instances for which to load the labels.
+    Returns:
+      the labels of these instances.
+    """
+    return session.run(self.labels_to_load,
+                       feed_dict={self.instances_to_load: instances})
+  def load_pairs(self, session, instances):
+    """Loads the word pairs for these instances.
+    Args:
+      session: The current TensorFlow session,
+      instances: The instances for which to load the labels.
+    Returns:
+      the word pairs of these instances.
+    """
+    word_pairs = session.run(self.pairs_to_load,
+                             feed_dict={self.instances_to_load: instances})
+    return [pair[0].split('::') for pair in word_pairs]
+  def __train_single_batch__(self, session, batch_instances):
+    """Train a single batch.
+    Args:
+      session: The current TensorFlow session.
+      batch_instances: TensorFlow examples containing the training intances
+    Returns:
+      The cost for the current batch.
+    """
+    cost, _ = session.run([self.cost, self.train_op],
+                          feed_dict={self.instances: batch_instances})
+    return cost
+  def fit(self, session, inputs, on_epoch_completed, val_instances, val_labels,
+          save_path):
+    """Train the model.
+    Args:
+      session: The current TensorFlow session.
+      inputs:
+      on_epoch_completed: A method to call after each epoch.
+      val_instances: The validation set instances (evaluation between epochs).
+      val_labels: The validation set labels (for evaluation between epochs).
+      save_path: Where to save the model.
+    """
+    for epoch in range(self.hparams.num_epochs):
+      losses = []
+      epoch_indices = list(np.random.permutation(len(inputs)))
+      # If the number of instances doesn't divide by batch_size, enlarge it
+      # by duplicating training examples
+      mod = len(epoch_indices) % self.hparams.batch_size
+      if mod > 0:
+        epoch_indices.extend([np.random.randint(0, high=len(inputs))] * mod)
+      # Define the batches
+      n_batches = len(epoch_indices) // self.hparams.batch_size
+      for minibatch in range(n_batches):
+        batch_indices = epoch_indices[minibatch * self.hparams.batch_size:(
+            minibatch + 1) * self.hparams.batch_size]
+        batch_instances = [inputs[i] for i in batch_indices]
+        loss = self.__train_single_batch__(session, batch_instances)
+        losses.append(loss)
+      epoch_loss = np.nanmean(losses)
+      if on_epoch_completed:
+        should_stop = on_epoch_completed(self, session, epoch, epoch_loss,
+                                         val_instances, val_labels, save_path)
+        if should_stop:
+          print('Stopping training after %d epochs.' % epoch)
+          return
+  def predict(self, session, inputs):
+    """Predict the classification of the test set.
+    Args:
+      session: The current TensorFlow session.
+      inputs: the train paths, x, y and/or nc vectors
+    Returns:
+      The test predictions.
+    """
+    predictions, _ = zip(*self.predict_with_score(session, inputs))
+    return np.array(predictions)
+  def predict_with_score(self, session, inputs):
+    """Predict the classification of the test set.
+    Args:
+      session: The current TensorFlow session.
+      inputs: the test paths, x, y and/or nc vectors
+    Returns:
+      The test predictions along with their scores.
+    """
+    test_pred = [0] * len(inputs)
+    for chunk in xrange(0, len(test_pred), self.hparams.batch_size):
+      # Initialize the variables with the current batch data
+      batch_indices = list(
+          range(chunk, min(chunk + self.hparams.batch_size, len(test_pred))))
+      # If the batch is too small, add a few other examples
+      if len(batch_indices) < self.hparams.batch_size:
+        batch_indices += [0] * (self.hparams.batch_size-len(batch_indices))
+      batch_instances = [inputs[i] for i in batch_indices]
+      predictions, scores = session.run(
+          [self.predictions, self.scores],
+          feed_dict={self.instances: batch_instances})
+      for index_in_batch, index_in_dataset in enumerate(batch_indices):
+        prediction = predictions[index_in_batch]
+        score = scores[index_in_batch][prediction]
+        test_pred[index_in_dataset] = (prediction, score)
+    return test_pred
+  def __mlp__(self):
+    """Performs the MLP operations.
+    Returns: the prediction object to be computed in a Session
+    """
+    # Define the operations
+    # Network input
+    vec_inputs = []
+    # Distributional component
+    if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
+      for emb_id in [self.x_embedding_id, self.y_embedding_id]:
+        vec_inputs.append(tf.nn.embedding_lookup(self.relata_lookup, emb_id))
+    # Noun compound component
+    if self.hparams.input in ['dist-nc', 'integrated-nc']:
+      vec = tf.nn.embedding_lookup(self.nc_lookup, self.nc_embedding_id)
+      vec_inputs.append(vec)
+    # Path-based component
+    if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
+      # Get the current paths for each batch instance
+      self.path_embeddings = tf.nn.embedding_lookup(self.path_lookup,
+                                                    self.path_embedding_id)
+      # self.path_embeddings is of shape
+      # [batch_size, max_path_per_instance, output_dim]
+      # We need to multiply it by path counts
+      # ([batch_size, max_path_per_instance]).
+      # Start by duplicating path_counts along the output_dim axis.
+      self.path_freq = tf.tile(tf.expand_dims(self.path_counts, -1),
+                               [1, 1, self.path_dim])
+      # Compute the averaged path vector for each instance.
+      # First, multiply the path embeddings and frequencies element-wise.
+      self.weighted = tf.multiply(self.path_freq, self.path_embeddings)
+      # Second, take the sum to get a tensor of shape [batch_size, output_dim].
+      self.pair_path_embeddings = tf.reduce_sum(self.weighted, 1)
+      # Finally, divide by the total number of paths.
+      # The number of paths for each pair has a shape [batch_size, 1],
+      # We duplicate it output_dim times along the second axis.
+      self.num_paths = tf.clip_by_value(
+          tf.reduce_sum(self.path_counts, 1), 1, np.inf)
+      self.num_paths = tf.tile(tf.expand_dims(self.num_paths, -1),
+                               [1, self.path_dim])
+      # And finally, divide pair_path_embeddings by num_paths element-wise.
+      self.pair_path_embeddings = tf.div(
+          self.pair_path_embeddings, self.num_paths)
+      vec_inputs.append(self.pair_path_embeddings)
+    # Concatenate the inputs and feed to the MLP
+    self.input_vec = tf.nn.dropout(
+        tf.concat(vec_inputs, 1),
+        keep_prob=self.hparams.input_keep_prob)
+    h = tf.matmul(self.input_vec, self.weights1)
+    self.output = h
+    if self.hparams.hidden_layers == 1:
+      self.output = tf.matmul(tf.nn.tanh(h), self.weights2)
+    self.scores = self.output
+    self.predictions = tf.argmax(self.scores, axis=1)
+    # Define the loss function and the optimization algorithm
+    self.cross_entropies = tf.nn.sparse_softmax_cross_entropy_with_logits(
+        logits=self.scores, labels=self.labels)
+    self.cost = tf.reduce_sum(self.cross_entropies, name='cost')
+    self.global_step = tf.Variable(0, name='global_step', trainable=False)
+    self.optimizer = tf.train.AdamOptimizer()
+    self.train_op = self.optimizer.minimize(
+        self.cost, global_step=self.global_step)
+def parse_tensorflow_examples(record, batch_size, path_to_index):
+  """Reads TensorFlow examples from a RecordReader.
+  Args:
+    record: a record with TensorFlow examples.
+    batch_size: the number of instances in a minibatch
+    path_to_index: mapping from string path to index in the embeddings matrix.
+  Returns:
+    The word embeddings IDs, paths and counts
+  """
+  features = tf.parse_example(
+      record, {
+          'x_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
+          'y_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
+          'nc_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
+          'reprs': tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.string, allow_missing=True),
+          'counts': tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+          'rel_id': tf.FixedLenFeature([1], dtype=tf.int64)
+      })
+  x_embedding_id = tf.squeeze(features['x_embedding_id'], [-1])
+  y_embedding_id = tf.squeeze(features['y_embedding_id'], [-1])
+  nc_embedding_id = tf.squeeze(features['nc_embedding_id'], [-1])
+  labels = tf.squeeze(features['rel_id'], [-1])
+  path_counts = tf.to_float(tf.reshape(features['counts'], [batch_size, -1]))
+  path_embedding_id = None
+  if path_to_index:
+    path_embedding_id = path_to_index.lookup(features['reprs'])
+  return (
+      x_embedding_id, y_embedding_id, nc_embedding_id,
+      path_embedding_id, path_counts, labels)
--- a/research/lexnet_nc/path_model.py
+++ b/research/lexnet_nc/path_model.py
+# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""LexNET Path-based Model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import itertools
+import os
+import lexnet_common
+import numpy as np
+import tensorflow as tf
+class PathBasedModel(object):
+  """The LexNET path-based model for classifying semantic relations."""
+  @classmethod
+  def default_hparams(cls):
+    """Returns the default hyper-parameters."""
+    return tf.contrib.training.HParams(
+        max_path_len=8,
+        num_classes=37,
+        num_epochs=30,
+        input_keep_prob=0.9,
+        learning_rate=0.001,
+        learn_lemmas=False,
+        random_seed=133,  # zero means no random seed
+        lemma_embeddings_file='glove/glove.6B.50d.bin',
+        num_pos=len(lexnet_common.POSTAGS),
+        num_dep=len(lexnet_common.DEPLABELS),
+        num_directions=len(lexnet_common.DIRS),
+        lemma_dim=50,
+        pos_dim=4,
+        dep_dim=5,
+        dir_dim=1)
+  def __init__(self, hparams, lemma_embeddings, instance):
+    """Initialize the LexNET classifier.
+    Args:
+      hparams: the hyper-parameters.
+      lemma_embeddings: word embeddings for the path-based component.
+      instance: string tensor containing the input instance
+    """
+    self.hparams = hparams
+    self.lemma_embeddings = lemma_embeddings
+    self.instance = instance
+    self.vocab_size, self.lemma_dim = self.lemma_embeddings.shape
+    # Set the random seed
+    if hparams.random_seed > 0:
+      tf.set_random_seed(hparams.random_seed)
+    # Create the network
+    self.__create_computation_graph__()
+  def __create_computation_graph__(self):
+    """Initialize the model and define the graph."""
+    self.lstm_input_dim = sum([self.hparams.lemma_dim, self.hparams.pos_dim,
+                               self.hparams.dep_dim, self.hparams.dir_dim])
+    self.lstm_output_dim = self.lstm_input_dim
+    network_input = self.lstm_output_dim
+    self.lemma_lookup = tf.get_variable(
+        'lemma_lookup',
+        initializer=self.lemma_embeddings,
+        dtype=tf.float32,
+        trainable=self.hparams.learn_lemmas)
+    self.pos_lookup = tf.get_variable(
+        'pos_lookup',
+        shape=[self.hparams.num_pos, self.hparams.pos_dim],
+        dtype=tf.float32)
+    self.dep_lookup = tf.get_variable(
+        'dep_lookup',
+        shape=[self.hparams.num_dep, self.hparams.dep_dim],
+        dtype=tf.float32)
+    self.dir_lookup = tf.get_variable(
+        'dir_lookup',
+        shape=[self.hparams.num_directions, self.hparams.dir_dim],
+        dtype=tf.float32)
+    self.weights1 = tf.get_variable(
+        'W1',
+        shape=[network_input, self.hparams.num_classes],
+        dtype=tf.float32)
+    self.bias1 = tf.get_variable(
+        'b1',
+        shape=[self.hparams.num_classes],
+        dtype=tf.float32)
+    # Define the variables
+    (self.batch_paths,
+     self.path_counts,
+     self.seq_lengths,
+     self.path_strings,
+     self.batch_labels) = _parse_tensorflow_example(
+         self.instance, self.hparams.max_path_len, self.hparams.input_keep_prob)
+    # Create the LSTM
+    self.__lstm__()
+    # Create the MLP
+    self.__mlp__()
+    self.instances_to_load = tf.placeholder(dtype=tf.string, shape=[None])
+    self.labels_to_load = lexnet_common.load_all_labels(self.instances_to_load)
+  def load_labels(self, session, batch_instances):
+    """Loads the labels of the current instances.
+    Args:
+      session: the current TensorFlow session.
+      batch_instances: the dataset instances.
+    Returns:
+      the labels.
+    """
+    return session.run(self.labels_to_load,
+                       feed_dict={self.instances_to_load: batch_instances})
+  def run_one_epoch(self, session, num_steps):
+    """Train the model.
+    Args:
+      session: The current TensorFlow session.
+      num_steps: The number of steps in each epoch.
+    Returns:
+      The mean loss for the epoch.
+    Raises:
+      ArithmeticError: if the loss becomes non-finite.
+    """
+    losses = []
+    for step in range(num_steps):
+      curr_loss, _ = session.run([self.cost, self.train_op])
+      if not np.isfinite(curr_loss):
+        raise ArithmeticError('nan loss at step %d' % step)
+      losses.append(curr_loss)
+    return np.mean(losses)
+  def predict(self, session, inputs):
+    """Predict the classification of the test set.
+    Args:
+      session: The current TensorFlow session.
+      inputs: the train paths, x, y and/or nc vectors
+    Returns:
+      The test predictions.
+    """
+    predictions, _ = zip(*self.predict_with_score(session, inputs))
+    return np.array(predictions)
+  def predict_with_score(self, session, inputs):
+    """Predict the classification of the test set.
+    Args:
+      session: The current TensorFlow session.
+      inputs: the test paths, x, y and/or nc vectors
+    Returns:
+      The test predictions along with their scores.
+    """
+    test_pred = [0] * len(inputs)
+    for index, instance in enumerate(inputs):
+      prediction, scores = session.run(
+          [self.predictions, self.scores],
+          feed_dict={self.instance: instance})
+      test_pred[index] = (prediction, scores[prediction])
+    return test_pred
+  def __mlp__(self):
+    """Performs the MLP operations.
+    Returns: the prediction object to be computed in a Session
+    """
+    # Feed the paths to the MLP: path_embeddings is
+    # [num_batch_paths, output_dim], and when we multiply it by W
+    # ([output_dim, num_classes]), we get a matrix of class distributions:
+    # [num_batch_paths, num_classes].
+    self.distributions = tf.matmul(self.path_embeddings, self.weights1)
+    # Now, compute weighted average on the class distributions, using the path
+    # frequency as weights.
+    # First, reshape path_freq to the same shape of distributions
+    self.path_freq = tf.tile(tf.expand_dims(self.path_counts, -1),
+                             [1, self.hparams.num_classes])
+    # Second, multiply the distributions and frequencies element-wise.
+    self.weighted = tf.multiply(self.path_freq, self.distributions)
+    # Finally, take the average to get a tensor of shape [1, num_classes].
+    self.weighted_sum = tf.reduce_sum(self.weighted, 0)
+    self.num_paths = tf.clip_by_value(tf.reduce_sum(self.path_counts),
+                                      1, np.inf)
+    self.num_paths = tf.tile(tf.expand_dims(self.num_paths, -1),
+                             [self.hparams.num_classes])
+    self.scores = tf.div(self.weighted_sum, self.num_paths)
+    self.predictions = tf.argmax(self.scores)
+    # Define the loss function and the optimization algorithm
+    self.cross_entropies = tf.nn.sparse_softmax_cross_entropy_with_logits(
+        logits=self.scores, labels=tf.reduce_mean(self.batch_labels))
+    self.cost = tf.reduce_sum(self.cross_entropies, name='cost')
+    self.global_step = tf.Variable(0, name='global_step', trainable=False)
+    self.optimizer = tf.train.AdamOptimizer()
+    self.train_op = self.optimizer.minimize(self.cost,
+                                            global_step=self.global_step)
+  def __lstm__(self):
+    """Defines the LSTM operations.
+    Returns:
+      A matrix of path embeddings.
+    """
+    lookup_tables = [self.lemma_lookup, self.pos_lookup,
+                     self.dep_lookup, self.dir_lookup]
+    # Split the edges to components: list of 4 tensors
+    # [num_batch_paths, max_path_len, 1]
+    self.edge_components = tf.split(self.batch_paths, 4, axis=2)
+    # Look up the components embeddings and concatenate them back together
+    self.path_matrix = tf.concat([
+        tf.squeeze(tf.nn.embedding_lookup(lookup_table, component), 2)
+        for lookup_table, component in
+        zip(lookup_tables, self.edge_components)
+    ], axis=2)
+    self.sequence_lengths = tf.reshape(self.seq_lengths, [-1])
+    # Define the LSTM.
+    # The input is [num_batch_paths, max_path_len, input_dim].
+    lstm_cell = tf.contrib.rnn.BasicLSTMCell(self.lstm_output_dim)
+    # The output is [num_batch_paths, max_path_len, output_dim].
+    self.lstm_outputs, _ = tf.nn.dynamic_rnn(
+        lstm_cell, self.path_matrix, dtype=tf.float32,
+        sequence_length=self.sequence_lengths)
+    # Slice the last *relevant* output for each instance ->
+    # [num_batch_paths, output_dim]
+    self.path_embeddings = _extract_last_relevant(self.lstm_outputs,
+                                                  self.sequence_lengths)
+def _parse_tensorflow_example(record, max_path_len, input_keep_prob):
+  """Reads TensorFlow examples from a RecordReader.
+  Args:
+    record: a record with TensorFlow example.
+    max_path_len: the maximum path length.
+    input_keep_prob: 1 - the word dropout probability
+  Returns:
+    The paths and counts
+  """
+  features = tf.parse_single_example(record, {
+      'lemmas':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'postags':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'deplabels':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'dirs':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'counts':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'pathlens':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.int64, allow_missing=True),
+      'reprs':
+          tf.FixedLenSequenceFeature(
+              shape=(), dtype=tf.string, allow_missing=True),
+      'rel_id':
+          tf.FixedLenFeature([], dtype=tf.int64)
+  })
+  path_counts = tf.to_float(features['counts'])
+  seq_lengths = features['pathlens']
+  # Concatenate the edge components to create a path tensor:
+  # [max_paths_per_ins, max_path_length, 4]
+  lemmas = _word_dropout(
+      tf.reshape(features['lemmas'], [-1, max_path_len]), input_keep_prob)
+  paths = tf.stack(
+      [lemmas] + [
+          tf.reshape(features[f], [-1, max_path_len])
+          for f in ('postags', 'deplabels', 'dirs')
+      ],
+      axis=-1)
+  path_strings = features['reprs']
+  # Add an empty path to pairs with no paths
+  paths = tf.cond(
+      tf.shape(paths)[0] > 0,
+      lambda: paths,
+      lambda: tf.zeros([1, max_path_len, 4], dtype=tf.int64))
+  # Paths are left-padded. We reverse them to make them right-padded.
+  paths = tf.reverse(paths, axis=[1])
+  path_counts = tf.cond(
+      tf.shape(path_counts)[0] > 0,
+      lambda: path_counts,
+      lambda: tf.constant([1.0], dtype=tf.float32))
+  seq_lengths = tf.cond(
+      tf.shape(seq_lengths)[0] > 0,
+      lambda: seq_lengths,
+      lambda: tf.constant([1], dtype=tf.int64))
+  # Duplicate the label for each path
+  labels = tf.ones_like(path_counts, dtype=tf.int64) * features['rel_id']
+  return paths, path_counts, seq_lengths, path_strings, labels
+def _extract_last_relevant(output, seq_lengths):
+  """Get the last relevant LSTM output cell for each batch instance.
+  Args:
+    output: the LSTM outputs - a tensor with shape
+    [num_paths, output_dim, max_path_len]
+    seq_lengths: the sequences length per instance
+  Returns:
+    The last relevant LSTM output cell for each batch instance.
+  """
+  max_length = int(output.get_shape()[1])
+  path_lengths = tf.clip_by_value(seq_lengths - 1, 0, max_length)
+  relevant = tf.reduce_sum(tf.multiply(output, tf.expand_dims(
+      tf.one_hot(path_lengths, max_length), -1)), 1)
+  return relevant
+def _word_dropout(words, input_keep_prob):
+  """Drops words with probability 1 - input_keep_prob.
+  Args:
+    words: a list of lemmas from the paths.
+    input_keep_prob: the probability to keep the word.
+  Returns:
+    The revised list where some of the words are <UNK>ed.
+  """
+  # Create the mask: (-1) to drop, 1 to keep
+  prob = tf.random_uniform(tf.shape(words), 0, 1)
+  condition = tf.less(prob, (1 - input_keep_prob))
+  mask = tf.where(condition,
+                  tf.negative(tf.ones_like(words)), tf.ones_like(words))
+  # We need to keep zeros (<PAD>), and change other numbers to 1 (<UNK>)
+  # if their mask is -1. First, we multiply the mask and the words.
+  # Zeros will stay zeros, and words to drop will become negative.
+  # Then, we change negative values to 1.
+  masked_words = tf.multiply(mask, words)
+  condition = tf.less(masked_words, 0)
+  dropped_words = tf.where(condition, tf.ones_like(words), words)
+  return dropped_words
+def compute_path_embeddings(model, session, instances):
+  """Compute the path embeddings for all the distinct paths.
+  Args:
+    model: The trained path-based model.
+    session: The current TensorFlow session.
+    instances: All the train, test and validation instances.
+  Returns:
+    The path to ID index and the path embeddings.
+  """
+  # Get an index for each distinct path
+  path_index = collections.defaultdict(itertools.count(0).next)
+  path_vectors = {}
+  for instance in instances:
+    curr_path_embeddings, curr_path_strings = session.run(
+        [model.path_embeddings, model.path_strings],
+        feed_dict={model.instance: instance})
+    for i, path in enumerate(curr_path_strings):
+      if not path:
+        continue
+      # Set a new/existing index for the path
+      index = path_index[path]
+      # Save its vector
+      path_vectors[index] = curr_path_embeddings[i, :]
+  print('Number of distinct paths: %d' % len(path_index))
+  return path_index, path_vectors
+def save_path_embeddings(model, path_vectors, path_index, embeddings_base_path):
+  """Saves the path embeddings.
+  Args:
+    model: The trained path-based model.
+    path_vectors: The path embeddings.
+    path_index: A map from path to ID.
+    embeddings_base_path: The base directory where the embeddings are.
+  """
+  index_range = range(max(path_index.values()) + 1)
+  path_matrix = [path_vectors[i] for i in index_range]
+  path_matrix = np.vstack(path_matrix)
+  # Save the path embeddings
+  path_vector_filename = os.path.join(
+      embeddings_base_path, '%d_path_vectors' % model.lstm_output_dim)
+  with open(path_vector_filename, 'w') as f_out:
+    np.save(f_out, path_matrix)
+  index_to_path = {i: p for p, i in path_index.iteritems()}
+  path_vocab = [index_to_path[i] for i in index_range]
+  # Save the path vocabulary
+  path_vocab_filename = os.path.join(
+      embeddings_base_path, '%d_path_vocab' % model.lstm_output_dim)
+  with open(path_vocab_filename, 'w') as f_out:
+    f_out.write('\n'.join(path_vocab))
+    f_out.write('\n')
+  print('Saved path embeddings.')
+def load_path_embeddings(path_embeddings_dir, path_dim):
+  """Loads pretrained path embeddings from a binary file and returns the matrix.
+  Args:
+    path_embeddings_dir: The directory for the path embeddings.
+    path_dim: The dimension of the path embeddings, used as prefix to the
+    path_vocab and path_vectors files.
+  Returns:
+    The path embeddings matrix and the path_to_index dictionary.
+  """
+  prefix = path_embeddings_dir + '/%d' % path_dim + '_'
+  with open(prefix + 'path_vocab') as f_in:
+    vocab = f_in.read().splitlines()
+  vocab_size = len(vocab)
+  embedding_file = prefix + 'path_vectors'
+  print('Embedding file "%s" has %d paths' % (embedding_file, vocab_size))
+  with open(embedding_file) as f_in:
+    embeddings = np.load(f_in)
+  path_to_index = {p: i for i, p in enumerate(vocab)}
+  return embeddings, path_to_index
+def get_indicative_paths(model, session, path_index, path_vectors, classes,
+                         save_dir, k=20, threshold=0.8):
+  """Gets the most indicative paths for each class.
+  Args:
+    model: The trained path-based model.
+    session: The current TensorFlow session.
+    path_index: A map from path to ID.
+    path_vectors: The path embeddings.
+    classes: The class label names.
+    save_dir: Where to save the paths.
+    k: The k for top-k paths.
+    threshold: The threshold above which to consider paths as indicative.
+  """
+  # Define graph variables for this operation
+  p_path_embedding = tf.placeholder(dtype=tf.float32,
+                                    shape=[1, model.lstm_output_dim])
+  p_distributions = tf.nn.softmax(tf.matmul(p_path_embedding, model.weights1))
+  # Treat each path as a pair instance with a single path, and get the
+  # relation distribution for it. Then, take the top paths for each relation.
+  # This dictionary contains a relation as a key, and the value is a list of
+  # tuples of path index and score. A relation r will contain (p, s) if the
+  # path p is classified to r with a confidence of s.
+  prediction_per_relation = collections.defaultdict(list)
+  index_to_path = {i: p for p, i in path_index.iteritems()}
+  # Predict all the paths
+  for index in range(len(path_index)):
+    curr_path_vector = path_vectors[index]
+    distribution = session.run(p_distributions,
+                               feed_dict={
+                                   p_path_embedding: np.reshape(
+                                       curr_path_vector,
+                                       [1, model.lstm_output_dim])})
+    distribution = distribution[0, :]
+    prediction = np.argmax(distribution)
+    prediction_per_relation[prediction].append(
+        (index, distribution[prediction]))
+    if index % 10000 == 0:
+      print('Classified %d/%d (%3.2f%%) of the paths' % (
+          index, len(path_index), 100 * index / len(path_index)))
+  # Retrieve k-best scoring paths for each relation
+  for relation_index, relation in enumerate(classes):
+    curr_paths = sorted(prediction_per_relation[relation_index],
+                        key=lambda item: item[1], reverse=True)
+    above_t = [(p, s) for (p, s) in curr_paths if s >= threshold]
+    top_k = curr_paths[k+1]
+    relation_paths = above_t if len(above_t) > len(top_k) else top_k
+    paths_filename = os.path.join(save_dir, '%s.paths' % relation)
+    with open(paths_filename, 'w') as f_out:
+      for index, score in relation_paths:
+        print('\t'.join([index_to_path[index], str(score)]), file=f_out)
--- a/research/swivel/glove_to_shards.py
+++ b/research/swivel/glove_to_shards.py
@@ -45,8 +45,6 @@ import sys
 import tensorflow as tf
-from six.moves import xrange
 flags = tf.app.flags
 flags.DEFINE_string('input', 'coocurrences.bin', 'Vocabulary file')
@@ -133,7 +131,6 @@ def make_shard_files(coocs, nshards, vocab_sz):
  return (shard_files, row_sums)
 def main(_):
  with open(FLAGS.vocab, 'r') as lines:
    orig_vocab_sz = sum(1 for _ in lines)
@@ -196,6 +193,5 @@ def main(_):
  print('done!')
 if __name__ == '__main__':
  tf.app.run()