Unverified Commit deb772d5 authored by Lukasz Kaiser's avatar Lukasz Kaiser Committed by GitHub
Browse files

Merge pull request #3683 from waterson/master

Add LexNet noun compounds model to models repository.
parents 2b775720 1f371534
...@@ -17,6 +17,7 @@ ...@@ -17,6 +17,7 @@
/research/inception/ @shlens @vincentvanhoucke /research/inception/ @shlens @vincentvanhoucke
/research/learned_optimizer/ @olganw @nirum /research/learned_optimizer/ @olganw @nirum
/research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum /research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
/research/lexnet_nc/ @vered1986 @waterson
/research/lfads/ @jazcollins @susillo /research/lfads/ @jazcollins @susillo
/research/lm_1b/ @oriolvinyals @panyx0718 /research/lm_1b/ @oriolvinyals @panyx0718
/research/maskgan/ @a-dai /research/maskgan/ @a-dai
......
...@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install). ...@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
- [inception](inception): deep convolutional networks for computer vision. - [inception](inception): deep convolutional networks for computer vision.
- [learning_to_remember_rare_events](learning_to_remember_rare_events): a - [learning_to_remember_rare_events](learning_to_remember_rare_events): a
large-scale life-long memory module for use in deep learning. large-scale life-long memory module for use in deep learning.
- [lexnet_nc](lexnet_nc): a distributed model for noun compound relationship
classification.
- [lfads](lfads): sequential variational autoencoder for analyzing - [lfads](lfads): sequential variational autoencoder for analyzing
neuroscience data. neuroscience data.
- [lm_1b](lm_1b): language modeling on the one billion word benchmark. - [lm_1b](lm_1b): language modeling on the one billion word benchmark.
......
# LexNET for Noun Compound Relation Classification
This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
algorithm for classifying relationships, specifically applied to classifying the
relationships that hold between noun compounds:
* *olive oil* is oil that is *made from* olives
* *cooking oil* which is oil that is *used for* cooking
* *motor oil* is oil that is *contained in* a motor
The model is a supervised classifier that predicts the relationship that holds
between the constituents of a two-word noun compound using:
1. A neural "paraphrase" of each syntactic dependency path that connects the
constituents in a large corpus. For example, given a sentence like *This fine
oil is made from first-press olives*, the dependency path is something like
`oil <NSUBJPASS made PREP> from POBJ> olive`.
2. The distributional information provided by the individual words; i.e., the
word embeddings of the two consituents.
3. The distributional signal provided by the compound itself; i.e., the
embedding of the noun compound in context.
The model includes several variants: *path-based model* uses (1) alone, the
*distributional model* uses (2) alone, and the *integrated model* uses (1) and
(2). The *distributional-nc model* and the *integrated-nc* model each add (3).
Training a model requires the following:
1. A collection of noun compounds that have been labeled using a *relation
inventory*. The inventory describes the specific relationships that you'd
like the model to differentiate (e.g. *part of* versus *composed of* versus
*purpose*), and generally may consist of tens of classes.
2. You'll need a collection of word embeddings: the path-based model uses the
word embeddings as part of the path representation, and the distributional
models use the word embeddings directly as prediction features.
3. The path-based model requires a collection of syntactic dependency parses
that connect the constituents for each noun compound.
At the moment, this repository does not contain the tools for generating this
data, but we will provide references to existing datasets and plan to add tools
to generate the data in the future.
# Contents
The following source code is included here:
* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
model to predict a noun-compound relationship given labeled noun-compounds and
dependency parse paths.
* `learn_classifier.py` is a script that trains and evaluates a classifier based
on any combination of paths, word embeddings, and noun-compound embeddings.
* `get_indicative_paths.py` is a script that generates the most indicative
syntactic dependency paths for a particular relationship.
# Dependencies
* [TensorFlow](http://www.tensorflow.org/): see detailed installation
instructions at that site.
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
with `pip install sklearn`.
# Creating the Model
This section describes the necessary steps that you must follow to reproduce the
results in the paper.
## Generate/Download Path Data
TBD! Our plan is to make the aggregate path data available that was used to
train path embeddings and classifiers; however, this will be released
separately.
## Generate/Download Embedding Data
TBD! While we used the standard Glove vectors for the relata embeddings, the NC
embeddings were generated separately. Our plan is to make that data available,
but it will be released separately.
## Create Path Embeddings
Create the path embeddings using `learn_path_embeddings.py`. This shell script
fragment will iterate through each dataset, split, and corpus to generate path
embeddings for each.
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
python learn_path_embeddings.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir /tmp/learn_path_embeddings
done
done
done
The path embeddings will be placed in the directory specified by
`--embeddings_base_path`.
## Train classifiers
Train classifiers and evaluate on the validation and test data using
`train_classifiers.py` script. This shell script fragment will iterate through
each dataset, split, corpus, and model type to train and evaluate classifiers.
LOGDIR=/tmp/learn_classifier
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
for MODEL in dist dist-nc path integrated integrated-nc ; do
# Filename for the log that will contain the classifier results.
LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
python learn_classifier.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir ${LOGDIR} \
--input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
done
done
done
done
The log file will contain the final performance (precision, recall, F1) on the
train, dev, and test sets, and will include a confusion matrix for each.
# Contact
If you have any questions, issues, or suggestions, feel free to contact either
@vered1986 or @waterson.
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Extracts paths that are indicative of each relation."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import tensorflow as tf
from . import path_model
from . import lexnet_common
tf.flags.DEFINE_string(
'dataset_dir', 'datasets',
'Dataset base directory')
tf.flags.DEFINE_string(
'dataset',
'tratz/fine_grained',
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir')
tf.flags.DEFINE_string(
'corpus', 'random/wiki',
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset')
tf.flags.DEFINE_string(
'embeddings_base_path', 'embeddings',
'Embeddings base directory')
tf.flags.DEFINE_string(
'logdir', 'logdir',
'Directory of model output files')
tf.flags.DEFINE_integer(
'top_k', 20, 'Number of top paths to extract')
tf.flags.DEFINE_float(
'threshold', 0.8, 'Threshold above which to consider paths as indicative')
FLAGS = tf.flags.FLAGS
def main(_):
hparams = path_model.PathBasedModel.default_hparams()
# First things first. Load the path data.
path_embeddings_file = 'path_embeddings/{dataset}/{corpus}'.format(
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
path_dim = (hparams.lemma_dim + hparams.pos_dim +
hparams.dep_dim + hparams.dir_dim)
path_embeddings, path_to_index = path_model.load_path_embeddings(
os.path.join(FLAGS.embeddings_base_path, path_embeddings_file),
path_dim)
# Load and count the classes so we can correctly instantiate the model.
classes_filename = os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
with open(classes_filename) as f_in:
classes = f_in.read().splitlines()
hparams.num_classes = len(classes)
# We need the word embeddings to instantiate the model, too.
print('Loading word embeddings...')
lemma_embeddings = lexnet_common.load_word_embeddings(
FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
# Instantiate the model.
with tf.Graph().as_default():
with tf.variable_scope('lexnet'):
instance = tf.placeholder(dtype=tf.string)
model = path_model.PathBasedModel(
hparams, lemma_embeddings, instance)
with tf.Session() as session:
model_dir = '{logdir}/results/{dataset}/path/{corpus}'.format(
logdir=FLAGS.logdir,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
saver = tf.train.Saver()
saver.restore(session, os.path.join(model_dir, 'best.ckpt'))
path_model.get_indicative_paths(
model, session, path_to_index, path_embeddings, classes,
model_dir, FLAGS.top_k, FLAGS.threshold)
if __name__ == '__main__':
tf.app.run()
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Trains the integrated LexNET classifier."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import lexnet_common
import lexnet_model
import path_model
from sklearn import metrics
import tensorflow as tf
tf.flags.DEFINE_string(
'dataset_dir', 'datasets',
'Dataset base directory')
tf.flags.DEFINE_string(
'dataset', 'tratz/fine_grained',
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir')
tf.flags.DEFINE_string(
'corpus', 'wiki/random',
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset')
tf.flags.DEFINE_string(
'embeddings_base_path', 'embeddings',
'Embeddings base directory')
tf.flags.DEFINE_string(
'logdir', 'logdir',
'Directory of model output files')
tf.flags.DEFINE_string('hparams', '', 'Hyper-parameters')
tf.flags.DEFINE_string(
'input', 'integrated',
'The model(dist/dist-nc/path/integrated/integrated-nc')
FLAGS = tf.flags.FLAGS
def main(_):
# Pick up any one-off hyper-parameters.
hparams = lexnet_model.LexNETModel.default_hparams()
hparams.corpus = FLAGS.corpus
hparams.input = FLAGS.input
hparams.path_embeddings_file = 'path_embeddings/%s/%s' % (
FLAGS.dataset, FLAGS.corpus)
input_dir = hparams.input if hparams.input != 'path' else 'path_classifier'
# Set the number of classes
classes_filename = os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
with open(classes_filename) as f_in:
classes = f_in.read().splitlines()
hparams.num_classes = len(classes)
print('Model will predict into %d classes' % hparams.num_classes)
# Get the datasets
train_set, val_set, test_set = (
os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
filename + '.tfrecs.gz')
for filename in ['train', 'val', 'test'])
print('Running with hyper-parameters: {}'.format(hparams))
# Load the instances
print('Loading instances...')
opts = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
val_instances = list(tf.python_io.tf_record_iterator(val_set, opts))
test_instances = list(tf.python_io.tf_record_iterator(test_set, opts))
# Load the word embeddings
print('Loading word embeddings...')
relata_embeddings, path_embeddings, nc_embeddings, path_to_index = (
None, None, None, None)
if hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
relata_embeddings = lexnet_common.load_word_embeddings(
FLAGS.embeddings_base_path, hparams.relata_embeddings_file)
if hparams.input in ['path', 'integrated', 'integrated-nc']:
path_embeddings, path_to_index = path_model.load_path_embeddings(
os.path.join(FLAGS.embeddings_base_path, hparams.path_embeddings_file),
hparams.path_dim)
if hparams.input in ['dist-nc', 'integrated-nc']:
nc_embeddings = lexnet_common.load_word_embeddings(
FLAGS.embeddings_base_path, hparams.nc_embeddings_file)
# Define the graph and the model
with tf.Graph().as_default():
model = lexnet_model.LexNETModel(
hparams, relata_embeddings, path_embeddings,
nc_embeddings, path_to_index)
# Initialize a session and start training
session = tf.Session()
session.run(tf.global_variables_initializer())
# Initalize the path mapping
if hparams.input in ['path', 'integrated', 'integrated-nc']:
session.run(tf.tables_initializer())
session.run(model.initialize_path_op, {
model.path_initial_value_t: path_embeddings
})
# Initialize the NC embeddings
if hparams.input in ['dist-nc', 'integrated-nc']:
session.run(model.initialize_nc_op, {
model.nc_initial_value_t: nc_embeddings
})
# Load the labels
print('Loading labels...')
train_labels = model.load_labels(session, train_instances)
val_labels = model.load_labels(session, val_instances)
test_labels = model.load_labels(session, test_instances)
save_path = '{logdir}/results/{dataset}/{input}/{corpus}'.format(
logdir=FLAGS.logdir, dataset=FLAGS.dataset,
corpus=model.hparams.corpus, input=input_dir)
if not os.path.exists(save_path):
os.makedirs(save_path)
# Train the model
print('Training the model...')
model.fit(session, train_instances, epoch_completed,
val_instances, val_labels, save_path)
# Print the best performance on the validation set
print('Best performance on the validation set: F1=%.3f' %
epoch_completed.best_f1)
# Evaluate on the train and validation sets
lexnet_common.full_evaluation(model, session, train_instances, train_labels,
'Train', classes)
lexnet_common.full_evaluation(model, session, val_instances, val_labels,
'Validation', classes)
test_predictions = lexnet_common.full_evaluation(
model, session, test_instances, test_labels, 'Test', classes)
# Write the test predictions to a file
predictions_file = os.path.join(save_path, 'test_predictions.tsv')
print('Saving test predictions to %s' % save_path)
test_pairs = model.load_pairs(session, test_instances)
lexnet_common.write_predictions(test_pairs, test_labels, test_predictions,
classes, predictions_file)
def epoch_completed(model, session, epoch, epoch_loss,
val_instances, val_labels, save_path):
"""Runs every time an epoch completes.
Print the performance on the validation set, and update the saved model if
its performance is better on the previous ones. If the performance dropped,
tell the training to stop.
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
epoch: The epoch number.
epoch_loss: The current epoch loss.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
save_path: Where to save the model.
Returns:
whether the training should stop.
"""
stop_training = False
# Evaluate on the validation set
val_pred = model.predict(session, val_instances)
precision, recall, f1, _ = metrics.precision_recall_fscore_support(
val_labels, val_pred, average='weighted')
print(
'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f\n' % (
epoch + 1, model.hparams.num_epochs, epoch_loss,
precision, recall, f1))
# If the F1 is much smaller than the previous one, stop training. Else, if
# it's bigger, save the model.
if f1 < epoch_completed.best_f1 - 0.08:
stop_training = True
if f1 > epoch_completed.best_f1:
saver = tf.train.Saver()
checkpoint_filename = os.path.join(save_path, 'best.ckpt')
print('Saving model in: %s' % checkpoint_filename)
saver.save(session, checkpoint_filename)
print('Model saved in file: %s' % checkpoint_filename)
epoch_completed.best_f1 = f1
return stop_training
epoch_completed.best_f1 = 0
if __name__ == '__main__':
tf.app.run(main)
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Trains the LexNET path-based model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import lexnet_common
import path_model
from sklearn import metrics
import tensorflow as tf
tf.flags.DEFINE_string(
'dataset_dir', 'datasets',
'Dataset base directory')
tf.flags.DEFINE_string(
'dataset',
'tratz/fine_grained',
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir')
tf.flags.DEFINE_string(
'corpus', 'random/wiki_gigawords',
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset')
tf.flags.DEFINE_string(
'embeddings_base_path', 'embeddings',
'Embeddings base directory')
tf.flags.DEFINE_string(
'logdir', 'logdir',
'Directory of model output files')
FLAGS = tf.flags.FLAGS
def main(_):
# Pick up any one-off hyper-parameters.
hparams = path_model.PathBasedModel.default_hparams()
# Set the number of classes
classes_filename = os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
with open(classes_filename) as f_in:
classes = f_in.read().splitlines()
hparams.num_classes = len(classes)
print('Model will predict into %d classes' % hparams.num_classes)
# Get the datasets
train_set, val_set, test_set = (
os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
filename + '.tfrecs.gz')
for filename in ['train', 'val', 'test'])
print('Running with hyper-parameters: {}'.format(hparams))
# Load the instances
print('Loading instances...')
opts = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
val_instances = list(tf.python_io.tf_record_iterator(val_set, opts))
test_instances = list(tf.python_io.tf_record_iterator(test_set, opts))
# Load the word embeddings
print('Loading word embeddings...')
lemma_embeddings = lexnet_common.load_word_embeddings(
FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
# Define the graph and the model
with tf.Graph().as_default():
with tf.variable_scope('lexnet'):
options = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
reader = tf.TFRecordReader(options=options)
_, train_instance = reader.read(
tf.train.string_input_producer([train_set]))
shuffled_train_instance = tf.train.shuffle_batch(
[train_instance],
batch_size=1,
num_threads=1,
capacity=len(train_instances),
min_after_dequeue=100,
)[0]
train_model = path_model.PathBasedModel(
hparams, lemma_embeddings, shuffled_train_instance)
with tf.variable_scope('lexnet', reuse=True):
val_instance = tf.placeholder(dtype=tf.string)
val_model = path_model.PathBasedModel(
hparams, lemma_embeddings, val_instance)
# Initialize a session and start training
logdir = (
'{logdir}/results/{dataset}/path/{corpus}/supervisor.logdir'.format(
logdir=FLAGS.logdir, dataset=FLAGS.dataset, corpus=FLAGS.corpus))
best_model_saver = tf.train.Saver()
f1_t = tf.placeholder(tf.float32)
best_f1_t = tf.Variable(0.0, trainable=False, name='best_f1')
assign_best_f1_op = tf.assign(best_f1_t, f1_t)
supervisor = tf.train.Supervisor(
logdir=logdir,
global_step=train_model.global_step)
with supervisor.managed_session() as session:
# Load the labels
print('Loading labels...')
val_labels = train_model.load_labels(session, val_instances)
save_path = '{logdir}/results/{dataset}/path/{corpus}/'.format(
logdir=FLAGS.logdir,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
# Train the model
print('Training the model...')
while True:
step = session.run(train_model.global_step)
epoch = (step + len(train_instances) - 1) // len(train_instances)
if epoch > hparams.num_epochs:
break
print('Starting epoch %d (step %d)...' % (1 + epoch, step))
epoch_loss = train_model.run_one_epoch(session, len(train_instances))
best_f1 = session.run(best_f1_t)
f1 = epoch_completed(val_model, session, epoch, epoch_loss,
val_instances, val_labels, best_model_saver,
save_path, best_f1)
if f1 > best_f1:
session.run(assign_best_f1_op, {f1_t: f1})
if f1 < best_f1 - 0.08:
tf.logging.fino('Stopping training after %d epochs.\n' % epoch)
break
# Print the best performance on the validation set
best_f1 = session.run(best_f1_t)
print('Best performance on the validation set: F1=%.3f' % best_f1)
# Save the path embeddings
print('Computing the path embeddings...')
instances = train_instances + val_instances + test_instances
path_index, path_vectors = path_model.compute_path_embeddings(
val_model, session, instances)
path_emb_dir = '{dir}/path_embeddings/{dataset}/{corpus}/'.format(
dir=FLAGS.embeddings_base_path,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
if not os.path.exists(path_emb_dir):
os.makedirs(path_emb_dir)
path_model.save_path_embeddings(
val_model, path_vectors, path_index, path_emb_dir)
def epoch_completed(model, session, epoch, epoch_loss,
val_instances, val_labels, saver, save_path, best_f1):
"""Runs every time an epoch completes.
Print the performance on the validation set, and update the saved model if
its performance is better on the previous ones. If the performance dropped,
tell the training to stop.
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
epoch: The epoch number.
epoch_loss: The current epoch loss.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
saver: tf.Saver object
save_path: Where to save the model.
best_f1: the best F1 achieved so far.
Returns:
The F1 achieved on the training set.
"""
# Evaluate on the validation set
val_pred = model.predict(session, val_instances)
precision, recall, f1, _ = metrics.precision_recall_fscore_support(
val_labels, val_pred, average='weighted')
print(
'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f\n' % (
epoch + 1, model.hparams.num_epochs, epoch_loss,
precision, recall, f1))
if f1 > best_f1:
print('Saving model in: %s' % (save_path + 'best.ckpt'))
saver.save(session, save_path + 'best.ckpt')
print('Model saved in file: %s' % (save_path + 'best.ckpt'))
return f1
if __name__ == '__main__':
tf.app.run(main)
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Common stuff used with LexNET."""
# pylint: disable=bad-whitespace
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import numpy as np
from sklearn import metrics
import tensorflow as tf
# Part of speech tags used in the paths.
POSTAGS = [
'PAD', 'VERB', 'CONJ', 'NOUN', 'PUNCT',
'ADP', 'ADJ', 'DET', 'ADV', 'PART',
'NUM', 'X', 'INTJ', 'SYM',
]
POSTAG_TO_ID = {tag: tid for tid, tag in enumerate(POSTAGS)}
# Dependency labels used in the paths.
DEPLABELS = [
'PAD', 'UNK', 'ROOT', 'abbrev', 'acomp', 'advcl',
'advmod', 'agent', 'amod', 'appos', 'attr', 'aux',
'auxpass', 'cc', 'ccomp', 'complm', 'conj', 'cop',
'csubj', 'csubjpass', 'dep', 'det', 'dobj', 'expl',
'infmod', 'iobj', 'mark', 'mwe', 'nc', 'neg',
'nn', 'npadvmod', 'nsubj', 'nsubjpass', 'num', 'number',
'p', 'parataxis', 'partmod', 'pcomp', 'pobj', 'poss',
'preconj', 'predet', 'prep', 'prepc', 'prt', 'ps',
'purpcl', 'quantmod', 'rcmod', 'ref', 'rel', 'suffix',
'title', 'tmod', 'xcomp', 'xsubj',
]
DEPLABEL_TO_ID = {label: lid for lid, label in enumerate(DEPLABELS)}
# Direction codes used in the paths.
DIRS = '_^V<>'
DIR_TO_ID = {dir: did for did, dir in enumerate(DIRS)}
def load_word_embeddings(word_embeddings_dir, word_embeddings_file):
"""Loads pretrained word embeddings from a binary file and returns the matrix.
Args:
word_embeddings_dir: The directory for the word embeddings.
word_embeddings_file: The pretrained word embeddings text file.
Returns:
The word embeddings matrix
"""
embedding_file = os.path.join(word_embeddings_dir, word_embeddings_file)
vocab_file = os.path.join(
word_embeddings_dir, os.path.dirname(word_embeddings_file), 'vocab.txt')
with open(vocab_file) as f_in:
vocab = [line.strip() for line in f_in]
vocab_size = len(vocab)
print('Embedding file "%s" has %d tokens' % (embedding_file, vocab_size))
with open(embedding_file) as f_in:
embeddings = np.load(f_in)
dim = embeddings.shape[1]
# Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y>
special_embeddings = np.random.normal(0, 0.1, (4, dim))
embeddings = np.vstack((special_embeddings, embeddings))
embeddings = embeddings.astype(np.float32)
return embeddings
def full_evaluation(model, session, instances, labels, set_name, classes):
"""Prints a full evaluation on the current set.
Performance (recall, precision and F1), classification report (per
class performance), and confusion matrix).
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
instances: The current set instances.
labels: The current set labels.
set_name: The current set name (train/validation/test).
classes: The class label names.
Returns:
The model's prediction for the given instances.
"""
# Predict the labels
pred = model.predict(session, instances)
# Print the performance
precision, recall, f1, _ = metrics.precision_recall_fscore_support(
labels, pred, average='weighted')
print('%s set: Precision: %.3f, Recall: %.3f, F1: %.3f' % (
set_name, precision, recall, f1))
# Print a classification report
print('%s classification report:' % set_name)
print(metrics.classification_report(labels, pred, target_names=classes))
# Print the confusion matrix
print('%s confusion matrix:' % set_name)
cm = metrics.confusion_matrix(labels, pred, labels=range(len(classes)))
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
print_cm(cm, labels=classes)
return pred
def print_cm(cm, labels):
"""Pretty print for confusion matrices.
From: https://gist.github.com/zachguo/10296432.
Args:
cm: The confusion matrix.
labels: The class names.
"""
columnwidth = 10
empty_cell = ' ' * columnwidth
short_labels = [label[:12].rjust(10, ' ') for label in labels]
# Print header
header = empty_cell + ' '
header += ''.join([' %{0}s '.format(columnwidth) % label
for label in short_labels])
print(header)
# Print rows
for i, label1 in enumerate(short_labels):
row = '%{0}s '.format(columnwidth) % label1[:10]
for j in range(len(short_labels)):
value = int(cm[i, j]) if not np.isnan(cm[i, j]) else 0
cell = ' %{0}d '.format(10) % value
row += cell + ' '
print(row)
def load_all_labels(records):
"""Reads TensorFlow examples from a RecordReader and returns only the labels.
Args:
records: a record list with TensorFlow examples.
Returns:
The labels
"""
curr_features = tf.parse_example(records, {
'rel_id': tf.FixedLenFeature([1], dtype=tf.int64),
})
labels = tf.squeeze(curr_features['rel_id'], [-1])
return labels
def load_all_pairs(records):
"""Reads TensorFlow examples from a RecordReader and returns the word pairs.
Args:
records: a record list with TensorFlow examples.
Returns:
The word pairs
"""
curr_features = tf.parse_example(records, {
'pair': tf.FixedLenFeature([1], dtype=tf.string)
})
word_pairs = curr_features['pair']
return word_pairs
def write_predictions(pairs, labels, predictions, classes, predictions_file):
"""Write the predictions to a file.
Args:
pairs: the word pairs (list of tuple of two strings).
labels: the gold-standard labels for these pairs (array of rel ID).
predictions: the predicted labels for these pairs (array of rel ID).
classes: a list of relation names.
predictions_file: where to save the predictions.
"""
with open(predictions_file, 'w') as f_out:
for pair, label, pred in zip(pairs, labels, predictions):
w1, w2 = pair
f_out.write('\t'.join([w1, w2, classes[label], classes[pred]]) + '\n')
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""The integrated LexNET model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import lexnet_common
import numpy as np
import tensorflow as tf
class LexNETModel(object):
"""The LexNET model for classifying relationships between noun compounds."""
@classmethod
def default_hparams(cls):
"""Returns the default hyper-parameters."""
return tf.contrib.training.HParams(
batch_size=10,
num_classes=37,
num_epochs=30,
input_keep_prob=0.9,
input='integrated', # dist/ dist-nc/ path/ integrated/ integrated-nc
learn_relata=False,
corpus='wiki_gigawords',
random_seed=133, # zero means no random seed
relata_embeddings_file='glove/glove.6B.300d.bin',
nc_embeddings_file='nc_glove/vecs.6B.300d.bin',
path_embeddings_file='path_embeddings/tratz/fine_grained/wiki',
hidden_layers=1,
path_dim=60)
def __init__(self, hparams, relata_embeddings, path_embeddings, nc_embeddings,
path_to_index):
"""Initialize the LexNET classifier.
Args:
hparams: the hyper-parameters.
relata_embeddings: word embeddings for the distributional component.
path_embeddings: embeddings for the paths.
nc_embeddings: noun compound embeddings.
path_to_index: a mapping from string path to an index in the path
embeddings matrix.
"""
self.hparams = hparams
self.path_embeddings = path_embeddings
self.relata_embeddings = relata_embeddings
self.nc_embeddings = nc_embeddings
self.vocab_size, self.relata_dim = 0, 0
self.path_to_index = None
self.path_dim = 0
# Set the random seed
if hparams.random_seed > 0:
tf.set_random_seed(hparams.random_seed)
# Get the vocabulary size and relata dim
if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
self.vocab_size, self.relata_dim = self.relata_embeddings.shape
# Create the mapping from string path to an index in the embeddings matrix
if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
self.path_to_index = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(
tf.constant(path_to_index.keys()),
tf.constant(path_to_index.values()),
key_dtype=tf.string, value_dtype=tf.int32), 0)
self.path_dim = self.path_embeddings.shape[1]
# Create the network
self.__create_computation_graph__()
def __create_computation_graph__(self):
"""Initialize the model and define the graph."""
network_input = 0
# Define the network inputs
# Distributional x and y
if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
network_input += 2 * self.relata_dim
self.relata_lookup = tf.get_variable(
'relata_lookup',
initializer=self.relata_embeddings,
dtype=tf.float32,
trainable=self.hparams.learn_relata)
# Path-based
if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
network_input += self.path_dim
self.path_initial_value_t = tf.placeholder(tf.float32, None)
self.path_lookup = tf.get_variable(
name='path_lookup',
dtype=tf.float32,
trainable=False,
shape=self.path_embeddings.shape)
self.initialize_path_op = tf.assign(
self.path_lookup, self.path_initial_value_t, validate_shape=False)
# Distributional noun compound
if self.hparams.input in ['dist-nc', 'integrated-nc']:
network_input += self.relata_dim
self.nc_initial_value_t = tf.placeholder(tf.float32, None)
self.nc_lookup = tf.get_variable(
name='nc_lookup',
dtype=tf.float32,
trainable=False,
shape=self.nc_embeddings.shape)
self.initialize_nc_op = tf.assign(
self.nc_lookup, self.nc_initial_value_t, validate_shape=False)
hidden_dim = network_input // 2
# Define the MLP
if self.hparams.hidden_layers == 0:
self.weights1 = tf.get_variable(
'W1',
shape=[network_input, self.hparams.num_classes],
dtype=tf.float32)
self.bias1 = tf.get_variable(
'b1',
shape=[self.hparams.num_classes],
dtype=tf.float32)
elif self.hparams.hidden_layers == 1:
self.weights1 = tf.get_variable(
'W1',
shape=[network_input, hidden_dim],
dtype=tf.float32)
self.bias1 = tf.get_variable(
'b1',
shape=[hidden_dim],
dtype=tf.float32)
self.weights2 = tf.get_variable(
'W2',
shape=[hidden_dim, self.hparams.num_classes],
dtype=tf.float32)
self.bias2 = tf.get_variable(
'b2',
shape=[self.hparams.num_classes],
dtype=tf.float32)
else:
raise ValueError('Only 0 or 1 hidden layers are supported')
# Define the variables
self.instances = tf.placeholder(dtype=tf.string,
shape=[self.hparams.batch_size])
(self.x_embedding_id,
self.y_embedding_id,
self.nc_embedding_id,
self.path_embedding_id,
self.path_counts,
self.labels) = parse_tensorflow_examples(
self.instances, self.hparams.batch_size, self.path_to_index)
# Create the MLP
self.__mlp__()
self.instances_to_load = tf.placeholder(dtype=tf.string, shape=[None])
self.labels_to_load = lexnet_common.load_all_labels(self.instances_to_load)
self.pairs_to_load = lexnet_common.load_all_pairs(self.instances_to_load)
def load_labels(self, session, instances):
"""Loads the labels for these instances.
Args:
session: The current TensorFlow session,
instances: The instances for which to load the labels.
Returns:
the labels of these instances.
"""
return session.run(self.labels_to_load,
feed_dict={self.instances_to_load: instances})
def load_pairs(self, session, instances):
"""Loads the word pairs for these instances.
Args:
session: The current TensorFlow session,
instances: The instances for which to load the labels.
Returns:
the word pairs of these instances.
"""
word_pairs = session.run(self.pairs_to_load,
feed_dict={self.instances_to_load: instances})
return [pair[0].split('::') for pair in word_pairs]
def __train_single_batch__(self, session, batch_instances):
"""Train a single batch.
Args:
session: The current TensorFlow session.
batch_instances: TensorFlow examples containing the training intances
Returns:
The cost for the current batch.
"""
cost, _ = session.run([self.cost, self.train_op],
feed_dict={self.instances: batch_instances})
return cost
def fit(self, session, inputs, on_epoch_completed, val_instances, val_labels,
save_path):
"""Train the model.
Args:
session: The current TensorFlow session.
inputs:
on_epoch_completed: A method to call after each epoch.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
save_path: Where to save the model.
"""
for epoch in range(self.hparams.num_epochs):
losses = []
epoch_indices = list(np.random.permutation(len(inputs)))
# If the number of instances doesn't divide by batch_size, enlarge it
# by duplicating training examples
mod = len(epoch_indices) % self.hparams.batch_size
if mod > 0:
epoch_indices.extend([np.random.randint(0, high=len(inputs))] * mod)
# Define the batches
n_batches = len(epoch_indices) // self.hparams.batch_size
for minibatch in range(n_batches):
batch_indices = epoch_indices[minibatch * self.hparams.batch_size:(
minibatch + 1) * self.hparams.batch_size]
batch_instances = [inputs[i] for i in batch_indices]
loss = self.__train_single_batch__(session, batch_instances)
losses.append(loss)
epoch_loss = np.nanmean(losses)
if on_epoch_completed:
should_stop = on_epoch_completed(self, session, epoch, epoch_loss,
val_instances, val_labels, save_path)
if should_stop:
print('Stopping training after %d epochs.' % epoch)
return
def predict(self, session, inputs):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the train paths, x, y and/or nc vectors
Returns:
The test predictions.
"""
predictions, _ = zip(*self.predict_with_score(session, inputs))
return np.array(predictions)
def predict_with_score(self, session, inputs):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the test paths, x, y and/or nc vectors
Returns:
The test predictions along with their scores.
"""
test_pred = [0] * len(inputs)
for chunk in xrange(0, len(test_pred), self.hparams.batch_size):
# Initialize the variables with the current batch data
batch_indices = list(
range(chunk, min(chunk + self.hparams.batch_size, len(test_pred))))
# If the batch is too small, add a few other examples
if len(batch_indices) < self.hparams.batch_size:
batch_indices += [0] * (self.hparams.batch_size-len(batch_indices))
batch_instances = [inputs[i] for i in batch_indices]
predictions, scores = session.run(
[self.predictions, self.scores],
feed_dict={self.instances: batch_instances})
for index_in_batch, index_in_dataset in enumerate(batch_indices):
prediction = predictions[index_in_batch]
score = scores[index_in_batch][prediction]
test_pred[index_in_dataset] = (prediction, score)
return test_pred
def __mlp__(self):
"""Performs the MLP operations.
Returns: the prediction object to be computed in a Session
"""
# Define the operations
# Network input
vec_inputs = []
# Distributional component
if self.hparams.input in ['dist', 'dist-nc', 'integrated', 'integrated-nc']:
for emb_id in [self.x_embedding_id, self.y_embedding_id]:
vec_inputs.append(tf.nn.embedding_lookup(self.relata_lookup, emb_id))
# Noun compound component
if self.hparams.input in ['dist-nc', 'integrated-nc']:
vec = tf.nn.embedding_lookup(self.nc_lookup, self.nc_embedding_id)
vec_inputs.append(vec)
# Path-based component
if self.hparams.input in ['path', 'integrated', 'integrated-nc']:
# Get the current paths for each batch instance
self.path_embeddings = tf.nn.embedding_lookup(self.path_lookup,
self.path_embedding_id)
# self.path_embeddings is of shape
# [batch_size, max_path_per_instance, output_dim]
# We need to multiply it by path counts
# ([batch_size, max_path_per_instance]).
# Start by duplicating path_counts along the output_dim axis.
self.path_freq = tf.tile(tf.expand_dims(self.path_counts, -1),
[1, 1, self.path_dim])
# Compute the averaged path vector for each instance.
# First, multiply the path embeddings and frequencies element-wise.
self.weighted = tf.multiply(self.path_freq, self.path_embeddings)
# Second, take the sum to get a tensor of shape [batch_size, output_dim].
self.pair_path_embeddings = tf.reduce_sum(self.weighted, 1)
# Finally, divide by the total number of paths.
# The number of paths for each pair has a shape [batch_size, 1],
# We duplicate it output_dim times along the second axis.
self.num_paths = tf.clip_by_value(
tf.reduce_sum(self.path_counts, 1), 1, np.inf)
self.num_paths = tf.tile(tf.expand_dims(self.num_paths, -1),
[1, self.path_dim])
# And finally, divide pair_path_embeddings by num_paths element-wise.
self.pair_path_embeddings = tf.div(
self.pair_path_embeddings, self.num_paths)
vec_inputs.append(self.pair_path_embeddings)
# Concatenate the inputs and feed to the MLP
self.input_vec = tf.nn.dropout(
tf.concat(vec_inputs, 1),
keep_prob=self.hparams.input_keep_prob)
h = tf.matmul(self.input_vec, self.weights1)
self.output = h
if self.hparams.hidden_layers == 1:
self.output = tf.matmul(tf.nn.tanh(h), self.weights2)
self.scores = self.output
self.predictions = tf.argmax(self.scores, axis=1)
# Define the loss function and the optimization algorithm
self.cross_entropies = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=self.scores, labels=self.labels)
self.cost = tf.reduce_sum(self.cross_entropies, name='cost')
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.optimizer = tf.train.AdamOptimizer()
self.train_op = self.optimizer.minimize(
self.cost, global_step=self.global_step)
def parse_tensorflow_examples(record, batch_size, path_to_index):
"""Reads TensorFlow examples from a RecordReader.
Args:
record: a record with TensorFlow examples.
batch_size: the number of instances in a minibatch
path_to_index: mapping from string path to index in the embeddings matrix.
Returns:
The word embeddings IDs, paths and counts
"""
features = tf.parse_example(
record, {
'x_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
'y_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
'nc_embedding_id': tf.FixedLenFeature([1], dtype=tf.int64),
'reprs': tf.FixedLenSequenceFeature(
shape=(), dtype=tf.string, allow_missing=True),
'counts': tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'rel_id': tf.FixedLenFeature([1], dtype=tf.int64)
})
x_embedding_id = tf.squeeze(features['x_embedding_id'], [-1])
y_embedding_id = tf.squeeze(features['y_embedding_id'], [-1])
nc_embedding_id = tf.squeeze(features['nc_embedding_id'], [-1])
labels = tf.squeeze(features['rel_id'], [-1])
path_counts = tf.to_float(tf.reshape(features['counts'], [batch_size, -1]))
path_embedding_id = None
if path_to_index:
path_embedding_id = path_to_index.lookup(features['reprs'])
return (
x_embedding_id, y_embedding_id, nc_embedding_id,
path_embedding_id, path_counts, labels)
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""LexNET Path-based Model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import itertools
import os
import lexnet_common
import numpy as np
import tensorflow as tf
class PathBasedModel(object):
"""The LexNET path-based model for classifying semantic relations."""
@classmethod
def default_hparams(cls):
"""Returns the default hyper-parameters."""
return tf.contrib.training.HParams(
max_path_len=8,
num_classes=37,
num_epochs=30,
input_keep_prob=0.9,
learning_rate=0.001,
learn_lemmas=False,
random_seed=133, # zero means no random seed
lemma_embeddings_file='glove/glove.6B.50d.bin',
num_pos=len(lexnet_common.POSTAGS),
num_dep=len(lexnet_common.DEPLABELS),
num_directions=len(lexnet_common.DIRS),
lemma_dim=50,
pos_dim=4,
dep_dim=5,
dir_dim=1)
def __init__(self, hparams, lemma_embeddings, instance):
"""Initialize the LexNET classifier.
Args:
hparams: the hyper-parameters.
lemma_embeddings: word embeddings for the path-based component.
instance: string tensor containing the input instance
"""
self.hparams = hparams
self.lemma_embeddings = lemma_embeddings
self.instance = instance
self.vocab_size, self.lemma_dim = self.lemma_embeddings.shape
# Set the random seed
if hparams.random_seed > 0:
tf.set_random_seed(hparams.random_seed)
# Create the network
self.__create_computation_graph__()
def __create_computation_graph__(self):
"""Initialize the model and define the graph."""
self.lstm_input_dim = sum([self.hparams.lemma_dim, self.hparams.pos_dim,
self.hparams.dep_dim, self.hparams.dir_dim])
self.lstm_output_dim = self.lstm_input_dim
network_input = self.lstm_output_dim
self.lemma_lookup = tf.get_variable(
'lemma_lookup',
initializer=self.lemma_embeddings,
dtype=tf.float32,
trainable=self.hparams.learn_lemmas)
self.pos_lookup = tf.get_variable(
'pos_lookup',
shape=[self.hparams.num_pos, self.hparams.pos_dim],
dtype=tf.float32)
self.dep_lookup = tf.get_variable(
'dep_lookup',
shape=[self.hparams.num_dep, self.hparams.dep_dim],
dtype=tf.float32)
self.dir_lookup = tf.get_variable(
'dir_lookup',
shape=[self.hparams.num_directions, self.hparams.dir_dim],
dtype=tf.float32)
self.weights1 = tf.get_variable(
'W1',
shape=[network_input, self.hparams.num_classes],
dtype=tf.float32)
self.bias1 = tf.get_variable(
'b1',
shape=[self.hparams.num_classes],
dtype=tf.float32)
# Define the variables
(self.batch_paths,
self.path_counts,
self.seq_lengths,
self.path_strings,
self.batch_labels) = _parse_tensorflow_example(
self.instance, self.hparams.max_path_len, self.hparams.input_keep_prob)
# Create the LSTM
self.__lstm__()
# Create the MLP
self.__mlp__()
self.instances_to_load = tf.placeholder(dtype=tf.string, shape=[None])
self.labels_to_load = lexnet_common.load_all_labels(self.instances_to_load)
def load_labels(self, session, batch_instances):
"""Loads the labels of the current instances.
Args:
session: the current TensorFlow session.
batch_instances: the dataset instances.
Returns:
the labels.
"""
return session.run(self.labels_to_load,
feed_dict={self.instances_to_load: batch_instances})
def run_one_epoch(self, session, num_steps):
"""Train the model.
Args:
session: The current TensorFlow session.
num_steps: The number of steps in each epoch.
Returns:
The mean loss for the epoch.
Raises:
ArithmeticError: if the loss becomes non-finite.
"""
losses = []
for step in range(num_steps):
curr_loss, _ = session.run([self.cost, self.train_op])
if not np.isfinite(curr_loss):
raise ArithmeticError('nan loss at step %d' % step)
losses.append(curr_loss)
return np.mean(losses)
def predict(self, session, inputs):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the train paths, x, y and/or nc vectors
Returns:
The test predictions.
"""
predictions, _ = zip(*self.predict_with_score(session, inputs))
return np.array(predictions)
def predict_with_score(self, session, inputs):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the test paths, x, y and/or nc vectors
Returns:
The test predictions along with their scores.
"""
test_pred = [0] * len(inputs)
for index, instance in enumerate(inputs):
prediction, scores = session.run(
[self.predictions, self.scores],
feed_dict={self.instance: instance})
test_pred[index] = (prediction, scores[prediction])
return test_pred
def __mlp__(self):
"""Performs the MLP operations.
Returns: the prediction object to be computed in a Session
"""
# Feed the paths to the MLP: path_embeddings is
# [num_batch_paths, output_dim], and when we multiply it by W
# ([output_dim, num_classes]), we get a matrix of class distributions:
# [num_batch_paths, num_classes].
self.distributions = tf.matmul(self.path_embeddings, self.weights1)
# Now, compute weighted average on the class distributions, using the path
# frequency as weights.
# First, reshape path_freq to the same shape of distributions
self.path_freq = tf.tile(tf.expand_dims(self.path_counts, -1),
[1, self.hparams.num_classes])
# Second, multiply the distributions and frequencies element-wise.
self.weighted = tf.multiply(self.path_freq, self.distributions)
# Finally, take the average to get a tensor of shape [1, num_classes].
self.weighted_sum = tf.reduce_sum(self.weighted, 0)
self.num_paths = tf.clip_by_value(tf.reduce_sum(self.path_counts),
1, np.inf)
self.num_paths = tf.tile(tf.expand_dims(self.num_paths, -1),
[self.hparams.num_classes])
self.scores = tf.div(self.weighted_sum, self.num_paths)
self.predictions = tf.argmax(self.scores)
# Define the loss function and the optimization algorithm
self.cross_entropies = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=self.scores, labels=tf.reduce_mean(self.batch_labels))
self.cost = tf.reduce_sum(self.cross_entropies, name='cost')
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.optimizer = tf.train.AdamOptimizer()
self.train_op = self.optimizer.minimize(self.cost,
global_step=self.global_step)
def __lstm__(self):
"""Defines the LSTM operations.
Returns:
A matrix of path embeddings.
"""
lookup_tables = [self.lemma_lookup, self.pos_lookup,
self.dep_lookup, self.dir_lookup]
# Split the edges to components: list of 4 tensors
# [num_batch_paths, max_path_len, 1]
self.edge_components = tf.split(self.batch_paths, 4, axis=2)
# Look up the components embeddings and concatenate them back together
self.path_matrix = tf.concat([
tf.squeeze(tf.nn.embedding_lookup(lookup_table, component), 2)
for lookup_table, component in
zip(lookup_tables, self.edge_components)
], axis=2)
self.sequence_lengths = tf.reshape(self.seq_lengths, [-1])
# Define the LSTM.
# The input is [num_batch_paths, max_path_len, input_dim].
lstm_cell = tf.contrib.rnn.BasicLSTMCell(self.lstm_output_dim)
# The output is [num_batch_paths, max_path_len, output_dim].
self.lstm_outputs, _ = tf.nn.dynamic_rnn(
lstm_cell, self.path_matrix, dtype=tf.float32,
sequence_length=self.sequence_lengths)
# Slice the last *relevant* output for each instance ->
# [num_batch_paths, output_dim]
self.path_embeddings = _extract_last_relevant(self.lstm_outputs,
self.sequence_lengths)
def _parse_tensorflow_example(record, max_path_len, input_keep_prob):
"""Reads TensorFlow examples from a RecordReader.
Args:
record: a record with TensorFlow example.
max_path_len: the maximum path length.
input_keep_prob: 1 - the word dropout probability
Returns:
The paths and counts
"""
features = tf.parse_single_example(record, {
'lemmas':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'postags':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'deplabels':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'dirs':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'counts':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'pathlens':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.int64, allow_missing=True),
'reprs':
tf.FixedLenSequenceFeature(
shape=(), dtype=tf.string, allow_missing=True),
'rel_id':
tf.FixedLenFeature([], dtype=tf.int64)
})
path_counts = tf.to_float(features['counts'])
seq_lengths = features['pathlens']
# Concatenate the edge components to create a path tensor:
# [max_paths_per_ins, max_path_length, 4]
lemmas = _word_dropout(
tf.reshape(features['lemmas'], [-1, max_path_len]), input_keep_prob)
paths = tf.stack(
[lemmas] + [
tf.reshape(features[f], [-1, max_path_len])
for f in ('postags', 'deplabels', 'dirs')
],
axis=-1)
path_strings = features['reprs']
# Add an empty path to pairs with no paths
paths = tf.cond(
tf.shape(paths)[0] > 0,
lambda: paths,
lambda: tf.zeros([1, max_path_len, 4], dtype=tf.int64))
# Paths are left-padded. We reverse them to make them right-padded.
paths = tf.reverse(paths, axis=[1])
path_counts = tf.cond(
tf.shape(path_counts)[0] > 0,
lambda: path_counts,
lambda: tf.constant([1.0], dtype=tf.float32))
seq_lengths = tf.cond(
tf.shape(seq_lengths)[0] > 0,
lambda: seq_lengths,
lambda: tf.constant([1], dtype=tf.int64))
# Duplicate the label for each path
labels = tf.ones_like(path_counts, dtype=tf.int64) * features['rel_id']
return paths, path_counts, seq_lengths, path_strings, labels
def _extract_last_relevant(output, seq_lengths):
"""Get the last relevant LSTM output cell for each batch instance.
Args:
output: the LSTM outputs - a tensor with shape
[num_paths, output_dim, max_path_len]
seq_lengths: the sequences length per instance
Returns:
The last relevant LSTM output cell for each batch instance.
"""
max_length = int(output.get_shape()[1])
path_lengths = tf.clip_by_value(seq_lengths - 1, 0, max_length)
relevant = tf.reduce_sum(tf.multiply(output, tf.expand_dims(
tf.one_hot(path_lengths, max_length), -1)), 1)
return relevant
def _word_dropout(words, input_keep_prob):
"""Drops words with probability 1 - input_keep_prob.
Args:
words: a list of lemmas from the paths.
input_keep_prob: the probability to keep the word.
Returns:
The revised list where some of the words are <UNK>ed.
"""
# Create the mask: (-1) to drop, 1 to keep
prob = tf.random_uniform(tf.shape(words), 0, 1)
condition = tf.less(prob, (1 - input_keep_prob))
mask = tf.where(condition,
tf.negative(tf.ones_like(words)), tf.ones_like(words))
# We need to keep zeros (<PAD>), and change other numbers to 1 (<UNK>)
# if their mask is -1. First, we multiply the mask and the words.
# Zeros will stay zeros, and words to drop will become negative.
# Then, we change negative values to 1.
masked_words = tf.multiply(mask, words)
condition = tf.less(masked_words, 0)
dropped_words = tf.where(condition, tf.ones_like(words), words)
return dropped_words
def compute_path_embeddings(model, session, instances):
"""Compute the path embeddings for all the distinct paths.
Args:
model: The trained path-based model.
session: The current TensorFlow session.
instances: All the train, test and validation instances.
Returns:
The path to ID index and the path embeddings.
"""
# Get an index for each distinct path
path_index = collections.defaultdict(itertools.count(0).next)
path_vectors = {}
for instance in instances:
curr_path_embeddings, curr_path_strings = session.run(
[model.path_embeddings, model.path_strings],
feed_dict={model.instance: instance})
for i, path in enumerate(curr_path_strings):
if not path:
continue
# Set a new/existing index for the path
index = path_index[path]
# Save its vector
path_vectors[index] = curr_path_embeddings[i, :]
print('Number of distinct paths: %d' % len(path_index))
return path_index, path_vectors
def save_path_embeddings(model, path_vectors, path_index, embeddings_base_path):
"""Saves the path embeddings.
Args:
model: The trained path-based model.
path_vectors: The path embeddings.
path_index: A map from path to ID.
embeddings_base_path: The base directory where the embeddings are.
"""
index_range = range(max(path_index.values()) + 1)
path_matrix = [path_vectors[i] for i in index_range]
path_matrix = np.vstack(path_matrix)
# Save the path embeddings
path_vector_filename = os.path.join(
embeddings_base_path, '%d_path_vectors' % model.lstm_output_dim)
with open(path_vector_filename, 'w') as f_out:
np.save(f_out, path_matrix)
index_to_path = {i: p for p, i in path_index.iteritems()}
path_vocab = [index_to_path[i] for i in index_range]
# Save the path vocabulary
path_vocab_filename = os.path.join(
embeddings_base_path, '%d_path_vocab' % model.lstm_output_dim)
with open(path_vocab_filename, 'w') as f_out:
f_out.write('\n'.join(path_vocab))
f_out.write('\n')
print('Saved path embeddings.')
def load_path_embeddings(path_embeddings_dir, path_dim):
"""Loads pretrained path embeddings from a binary file and returns the matrix.
Args:
path_embeddings_dir: The directory for the path embeddings.
path_dim: The dimension of the path embeddings, used as prefix to the
path_vocab and path_vectors files.
Returns:
The path embeddings matrix and the path_to_index dictionary.
"""
prefix = path_embeddings_dir + '/%d' % path_dim + '_'
with open(prefix + 'path_vocab') as f_in:
vocab = f_in.read().splitlines()
vocab_size = len(vocab)
embedding_file = prefix + 'path_vectors'
print('Embedding file "%s" has %d paths' % (embedding_file, vocab_size))
with open(embedding_file) as f_in:
embeddings = np.load(f_in)
path_to_index = {p: i for i, p in enumerate(vocab)}
return embeddings, path_to_index
def get_indicative_paths(model, session, path_index, path_vectors, classes,
save_dir, k=20, threshold=0.8):
"""Gets the most indicative paths for each class.
Args:
model: The trained path-based model.
session: The current TensorFlow session.
path_index: A map from path to ID.
path_vectors: The path embeddings.
classes: The class label names.
save_dir: Where to save the paths.
k: The k for top-k paths.
threshold: The threshold above which to consider paths as indicative.
"""
# Define graph variables for this operation
p_path_embedding = tf.placeholder(dtype=tf.float32,
shape=[1, model.lstm_output_dim])
p_distributions = tf.nn.softmax(tf.matmul(p_path_embedding, model.weights1))
# Treat each path as a pair instance with a single path, and get the
# relation distribution for it. Then, take the top paths for each relation.
# This dictionary contains a relation as a key, and the value is a list of
# tuples of path index and score. A relation r will contain (p, s) if the
# path p is classified to r with a confidence of s.
prediction_per_relation = collections.defaultdict(list)
index_to_path = {i: p for p, i in path_index.iteritems()}
# Predict all the paths
for index in range(len(path_index)):
curr_path_vector = path_vectors[index]
distribution = session.run(p_distributions,
feed_dict={
p_path_embedding: np.reshape(
curr_path_vector,
[1, model.lstm_output_dim])})
distribution = distribution[0, :]
prediction = np.argmax(distribution)
prediction_per_relation[prediction].append(
(index, distribution[prediction]))
if index % 10000 == 0:
print('Classified %d/%d (%3.2f%%) of the paths' % (
index, len(path_index), 100 * index / len(path_index)))
# Retrieve k-best scoring paths for each relation
for relation_index, relation in enumerate(classes):
curr_paths = sorted(prediction_per_relation[relation_index],
key=lambda item: item[1], reverse=True)
above_t = [(p, s) for (p, s) in curr_paths if s >= threshold]
top_k = curr_paths[k+1]
relation_paths = above_t if len(above_t) > len(top_k) else top_k
paths_filename = os.path.join(save_dir, '%s.paths' % relation)
with open(paths_filename, 'w') as f_out:
for index, score in relation_paths:
print('\t'.join([index_to_path[index], str(score)]), file=f_out)
...@@ -45,8 +45,6 @@ import sys ...@@ -45,8 +45,6 @@ import sys
import tensorflow as tf import tensorflow as tf
from six.moves import xrange
flags = tf.app.flags flags = tf.app.flags
flags.DEFINE_string('input', 'coocurrences.bin', 'Vocabulary file') flags.DEFINE_string('input', 'coocurrences.bin', 'Vocabulary file')
...@@ -133,7 +131,6 @@ def make_shard_files(coocs, nshards, vocab_sz): ...@@ -133,7 +131,6 @@ def make_shard_files(coocs, nshards, vocab_sz):
return (shard_files, row_sums) return (shard_files, row_sums)
def main(_): def main(_):
with open(FLAGS.vocab, 'r') as lines: with open(FLAGS.vocab, 'r') as lines:
orig_vocab_sz = sum(1 for _ in lines) orig_vocab_sz = sum(1 for _ in lines)
...@@ -196,6 +193,5 @@ def main(_): ...@@ -196,6 +193,5 @@ def main(_):
print('done!') print('done!')
if __name__ == '__main__': if __name__ == '__main__':
tf.app.run() tf.app.run()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment