Unverified Commit ab6f78d2 authored by Chris Waterson's avatar Chris Waterson Committed by GitHub
Browse files

Add utilities and documentation for preparing training data. (#5399)



* add scripts to generate training data

* add scripts to generate training data

* add missing POS tag mapping

* tweak flags to not rely on directory structure, fix issues with output.

* add blurbs for new programs

* oops, need to initialize the embeddings!

* Misspelled 'info'

* editing
Co-authored-by: default avatarChris Waterson <waterson@google.com>
parent bb6f092a
...@@ -33,17 +33,16 @@ Training a model requires the following: ...@@ -33,17 +33,16 @@ Training a model requires the following:
1. A collection of noun compounds that have been labeled using a *relation 1. A collection of noun compounds that have been labeled using a *relation
inventory*. The inventory describes the specific relationships that you'd inventory*. The inventory describes the specific relationships that you'd
like the model to differentiate (e.g. *part of* versus *composed of* versus like the model to differentiate (e.g. *part of* versus *composed of* versus
*purpose*), and generally may consist of tens of classes. *purpose*), and generally may consist of tens of classes. You can download
You can download the dataset used in the paper from [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz). the dataset used in the paper from
2. You'll need a collection of word embeddings: the path-based model uses the [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz).
word embeddings as part of the path representation, and the distributional 2. A collection of word embeddings: the path-based model uses the word
models use the word embeddings directly as prediction features. embeddings as part of the path representation, and the distributional models
use the word embeddings directly as prediction features.
3. The path-based model requires a collection of syntactic dependency parses 3. The path-based model requires a collection of syntactic dependency parses
that connect the constituents for each noun compound. that connect the constituents for each noun compound. To generate these,
you'll need a corpus from which to train this data; we used Wikipedia and the
At the moment, this repository does not contain the tools for generating this [LDC GigaWord5](https://catalog.ldc.upenn.edu/LDC2011T07) corpora.
data, but we will provide references to existing datasets and plan to add tools
to generate the data in the future.
# Contents # Contents
...@@ -57,51 +56,125 @@ The following source code is included here: ...@@ -57,51 +56,125 @@ The following source code is included here:
* `get_indicative_paths.py` is a script that generates the most indicative * `get_indicative_paths.py` is a script that generates the most indicative
syntactic dependency paths for a particular relationship. syntactic dependency paths for a particular relationship.
Also included are utilities for preparing data for training:
* `text_embeddings_to_binary.py` converts a text file containing word embeddings
into a binary file that is quicker to load.
* `extract_paths.py` finds all the dependency paths that connect words in a
corpus.
* `sorted_paths_to_examples.py` processes the output of `extract_paths.py` to
produce summarized training data.
This code (in particular, the utilities used to prepare the data) differs from
the code that was used to prepare data for the paper. Notably, we used a
proprietary dependency parser instead of spaCy, which is used here.
# Dependencies # Dependencies
* [TensorFlow](http://www.tensorflow.org/): see detailed installation * [TensorFlow](http://www.tensorflow.org/): see detailed installation
instructions at that site. instructions at that site.
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this * [SciKit Learn](http://scikit-learn.org/): you can probably just install this
with `pip install sklearn`. with `pip install sklearn`.
* [SpaCy](https://spacy.io/): `pip install spacy` ought to do the trick, along
with the English model.
# Creating the Model # Creating the Model
This section describes the necessary steps that you must follow to reproduce the This sections described the steps necessary to create and evaluate the model
results in the paper. described in the paper.
## Generate Path Data
To begin, you need three text files:
1. **Corpus**. This file should contain natural language sentences, written with
one sentence per line. For purposes of exposition, we'll assume that you
have English Wikipedia serialized this way in `${HOME}/data/wiki.txt`.
2. **Labeled Noun Compound Pairs**. This file contain (modfier, head, label)
tuples, tab-separated, with one per line. The *label* represented the
relationship between the head and the modifier; e.g., if `purpose` is one
your labels, you could possibly include `tooth<tab>paste<tab>purpose`.
3. **Word Embeddings**. We used the
[GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings; in
particular the 6B token, 300d variant. We'll assume you have this file as
`${HOME}/data/glove.6B.300d.txt`.
We first processed the embeddings from their text format into something that we
can load a little bit more quickly:
./text_embeddings_to_binary.py \
--input ${HOME}/data/glove.6B.300d.txt \
--output_vocab ${HOME}/data/vocab.txt \
--output_npy ${HOME}/data/glove.6B.300d.npy
Next, we'll extract all the dependency parse paths connecting our labeled pairs
from the corpus. This process takes a *looooong* time, but is trivially
parallelized using map-reduce if you have access to that technology.
./extract_paths.py \
--corpus ${HOME}/data/wiki.txt \
--labeled_pairs ${HOME}/data/labeled-pairs.tsv \
--output ${HOME}/data/paths.tsv
The file it produces (`paths.tsv`) is a tab-separated file that contains the
modifier, the head, the label, the encoded path, and the sentence from which the
path was drawn. (This last is mostly for sanity checking.) A sample row might
look something like this (where newlines would actually be tab characters):
navy
captain
owner_emp_use
<X>/PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::<Y>/PROPN/compound/<
He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus
This file must be sorted as follows:
sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv
## Generate/Download Path Data In particular, rows with the same modifier, head, and label must appear
contiguously.
TBD! Our plan is to make the aggregate path data available that was used to We next create a file that contains all the relation labels from our original
train path embeddings and classifiers; however, this will be released labeled pairs:
separately.
## Generate/Download Embedding Data awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \
| sort -u > ${HOME}/data/relations.txt
TBD! While we used the standard Glove vectors for the relata embeddings, the NC With these in hand, we're ready to produce the train, validation, and test data:
embeddings were generated separately. Our plan is to make that data available,
but it will be released separately. ./sorted_paths_to_examples.py \
--input ${HOME}/data/sorted.paths.tsv \
--vocab ${HOME}/data/vocab.txt \
--relations ${HOME}/data/relations.txt \
--splits ${HOME}/data/splits.txt \
--output_dir ${HOME}/data
Here, `splits.txt` is a file that indicates which "split" (train, test, or
validation) you want the pair to appear in. It should be a tab-separate file
which conatins the modifier, head, and the dataset ( `train`, `test`, or `val`)
into which the pair should be placed; e.g.,:
tooth <TAB> paste <TAB> train
banana <TAB> seat <TAB> test
The program will produce a separate file for each dataset split in the directory
specified by `--output_dir`. Each file is contains `tf.train.Example` protocol
buffers encoded using the `TFRecord` file format.
## Create Path Embeddings ## Create Path Embeddings
Create the path embeddings using `learn_path_embeddings.py`. This shell script Now we're ready to train the path embeddings using `learn_path_embeddings.py`:
fragment will iterate through each dataset, split, and corpus to generate path
embeddings for each.
for DATASET in tratz/fine_grained tratz/coarse_grained ; do ./learn_path_embeddings.py \
for SPLIT in random lexical_head lexical_mod lexical_full ; do --train ${HOME}/data/train.tfrecs.gz \
for CORPUS in wiki_gigiawords ; do --val ${HOME}/data/val.tfrecs.gz \
python learn_path_embeddings.py \ --text ${HOME}/data/test.tfrecs.gz \
--dataset_dir ~/lexnet/datasets \ --embeddings ${HOME}/data/glove.6B.300d.npy
--dataset "${DATASET}" \ --relations ${HOME}/data/relations.txt
--corpus "${SPLIT}/${CORPUS}" \ --output ${HOME}/data/path-embeddings \
--embeddings_base_path ~/lexnet/embeddings \
--logdir /tmp/learn_path_embeddings --logdir /tmp/learn_path_embeddings
done
done
done
The path embeddings will be placed in the directory specified by The path embeddings will be placed at the location specified by `--output`.
`--embeddings_base_path`.
## Train classifiers ## Train classifiers
......
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import itertools
import sys
import spacy
import tensorflow as tf
tf.flags.DEFINE_string('corpus', '', 'Filename of corpus')
tf.flags.DEFINE_string('labeled_pairs', '', 'Filename of labeled pairs')
tf.flags.DEFINE_string('output', '', 'Filename of output file')
FLAGS = tf.flags.FLAGS
def get_path(mod_token, head_token):
"""Returns the path between a modifier token and a head token."""
# Compute the path from the root to each token.
mod_ancestors = list(reversed(list(mod_token.ancestors)))
head_ancestors = list(reversed(list(head_token.ancestors)))
# If the paths don't start at the same place (odd!) then there is no path at
# all.
if (not mod_ancestors or not head_ancestors
or mod_ancestors[0] != head_ancestors[0]):
return None
# Eject elements from the common path until we reach the first differing
# ancestor.
ix = 1
while (ix < len(mod_ancestors) and ix < len(head_ancestors)
and mod_ancestors[ix] == head_ancestors[ix]):
ix += 1
# Construct the path. TODO: add "satellites", possibly honor sentence
# ordering between modifier and head rather than just always traversing from
# the modifier to the head?
path = ['/'.join(('<X>', mod_token.pos_, mod_token.dep_, '>'))]
path += ['/'.join((tok.lemma_, tok.pos_, tok.dep_, '>'))
for tok in reversed(mod_ancestors[ix:])]
root_token = mod_ancestors[ix - 1]
path += ['/'.join((root_token.lemma_, root_token.pos_, root_token.dep_, '^'))]
path += ['/'.join((tok.lemma_, tok.pos_, tok.dep_, '<'))
for tok in head_ancestors[ix:]]
path += ['/'.join(('<Y>', head_token.pos_, head_token.dep_, '<'))]
return '::'.join(path)
def main(_):
nlp = spacy.load('en_core_web_sm')
# Grab the set of labeled pairs for which we wish to collect paths.
with tf.gfile.GFile(FLAGS.labeled_pairs) as fh:
parts = (l.decode('utf-8').split('\t') for l in fh.read().splitlines())
labeled_pairs = {(mod, head): rel for mod, head, rel in parts}
# Create a mapping from each head to the modifiers that are used with it.
mods_for_head = {
head: set(hm[1] for hm in head_mods)
for head, head_mods in itertools.groupby(
sorted((head, mod) for (mod, head) in labeled_pairs.iterkeys()),
lambda (head, mod): head)}
# Collect all the heads that we know about.
heads = set(mods_for_head.keys())
# For each sentence that contains a (head, modifier) pair that's in our set,
# emit the dependency path that connects the pair.
out_fh = sys.stdout if not FLAGS.output else tf.gfile.GFile(FLAGS.output, 'w')
in_fh = sys.stdin if not FLAGS.corpus else tf.gfile.GFile(FLAGS.corpus)
num_paths = 0
for line, sen in enumerate(in_fh, start=1):
if line % 100 == 0:
print('\rProcessing line %d: %d paths' % (line, num_paths),
end='', file=sys.stderr)
sen = sen.decode('utf-8').strip()
doc = nlp(sen)
for head_token in doc:
head_text = head_token.text.lower()
if head_text in heads:
mods = mods_for_head[head_text]
for mod_token in doc:
mod_text = mod_token.text.lower()
if mod_text in mods:
path = get_path(mod_token, head_token)
if path:
label = labeled_pairs[(mod_text, head_text)]
line = '\t'.join((mod_text, head_text, label, path, sen))
print(line.encode('utf-8'), file=out_fh)
num_paths += 1
out_fh.close()
if __name__ == '__main__':
tf.app.run()
...@@ -26,29 +26,13 @@ import path_model ...@@ -26,29 +26,13 @@ import path_model
from sklearn import metrics from sklearn import metrics
import tensorflow as tf import tensorflow as tf
tf.flags.DEFINE_string( tf.flags.DEFINE_string('train', '', 'training dataset, tfrecs')
'dataset_dir', 'datasets', tf.flags.DEFINE_string('val', '', 'validation dataset, tfrecs')
'Dataset base directory') tf.flags.DEFINE_string('test', '', 'test dataset, tfrecs')
tf.flags.DEFINE_string('embeddings', '', 'embeddings, npy')
tf.flags.DEFINE_string( tf.flags.DEFINE_string('relations', '', 'file containing relation labels')
'dataset', tf.flags.DEFINE_string('output_dir', '', 'output directory for path embeddings')
'tratz/fine_grained', tf.flags.DEFINE_string('logdir', '', 'directory for model training')
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir')
tf.flags.DEFINE_string(
'corpus', 'random/wiki_gigawords',
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset')
tf.flags.DEFINE_string(
'embeddings_base_path', 'embeddings',
'Embeddings base directory')
tf.flags.DEFINE_string(
'logdir', 'logdir',
'Directory of model output files')
FLAGS = tf.flags.FLAGS FLAGS = tf.flags.FLAGS
...@@ -56,37 +40,26 @@ def main(_): ...@@ -56,37 +40,26 @@ def main(_):
# Pick up any one-off hyper-parameters. # Pick up any one-off hyper-parameters.
hparams = path_model.PathBasedModel.default_hparams() hparams = path_model.PathBasedModel.default_hparams()
# Set the number of classes with open(FLAGS.relations) as fh:
classes_filename = os.path.join( relations = fh.read().splitlines()
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
with open(classes_filename) as f_in:
classes = f_in.read().splitlines()
hparams.num_classes = len(classes) hparams.num_classes = len(relations)
print('Model will predict into %d classes' % hparams.num_classes) print('Model will predict into %d classes' % hparams.num_classes)
# Get the datasets
train_set, val_set, test_set = (
os.path.join(
FLAGS.dataset_dir, FLAGS.dataset, FLAGS.corpus,
filename + '.tfrecs.gz')
for filename in ['train', 'val', 'test'])
print('Running with hyper-parameters: {}'.format(hparams)) print('Running with hyper-parameters: {}'.format(hparams))
# Load the instances # Load the instances
print('Loading instances...') print('Loading instances...')
opts = tf.python_io.TFRecordOptions( opts = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP) compression_type=tf.python_io.TFRecordCompressionType.GZIP)
train_instances = list(tf.python_io.tf_record_iterator(train_set, opts))
val_instances = list(tf.python_io.tf_record_iterator(val_set, opts)) train_instances = list(tf.python_io.tf_record_iterator(FLAGS.train, opts))
test_instances = list(tf.python_io.tf_record_iterator(test_set, opts)) val_instances = list(tf.python_io.tf_record_iterator(FLAGS.val, opts))
test_instances = list(tf.python_io.tf_record_iterator(FLAGS.test, opts))
# Load the word embeddings # Load the word embeddings
print('Loading word embeddings...') print('Loading word embeddings...')
lemma_embeddings = lexnet_common.load_word_embeddings( lemma_embeddings = lexnet_common.load_word_embeddings(FLAGS.embeddings)
FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
# Define the graph and the model # Define the graph and the model
with tf.Graph().as_default(): with tf.Graph().as_default():
...@@ -95,7 +68,7 @@ def main(_): ...@@ -95,7 +68,7 @@ def main(_):
compression_type=tf.python_io.TFRecordCompressionType.GZIP) compression_type=tf.python_io.TFRecordCompressionType.GZIP)
reader = tf.TFRecordReader(options=options) reader = tf.TFRecordReader(options=options)
_, train_instance = reader.read( _, train_instance = reader.read(
tf.train.string_input_producer([train_set])) tf.train.string_input_producer([FLAGS.train]))
shuffled_train_instance = tf.train.shuffle_batch( shuffled_train_instance = tf.train.shuffle_batch(
[train_instance], [train_instance],
batch_size=1, batch_size=1,
...@@ -113,17 +86,13 @@ def main(_): ...@@ -113,17 +86,13 @@ def main(_):
hparams, lemma_embeddings, val_instance) hparams, lemma_embeddings, val_instance)
# Initialize a session and start training # Initialize a session and start training
logdir = (
'{logdir}/results/{dataset}/path/{corpus}/supervisor.logdir'.format(
logdir=FLAGS.logdir, dataset=FLAGS.dataset, corpus=FLAGS.corpus))
best_model_saver = tf.train.Saver() best_model_saver = tf.train.Saver()
f1_t = tf.placeholder(tf.float32) f1_t = tf.placeholder(tf.float32)
best_f1_t = tf.Variable(0.0, trainable=False, name='best_f1') best_f1_t = tf.Variable(0.0, trainable=False, name='best_f1')
assign_best_f1_op = tf.assign(best_f1_t, f1_t) assign_best_f1_op = tf.assign(best_f1_t, f1_t)
supervisor = tf.train.Supervisor( supervisor = tf.train.Supervisor(
logdir=logdir, logdir=FLAGS.logdir,
global_step=train_model.global_step) global_step=train_model.global_step)
with supervisor.managed_session() as session: with supervisor.managed_session() as session:
...@@ -131,11 +100,6 @@ def main(_): ...@@ -131,11 +100,6 @@ def main(_):
print('Loading labels...') print('Loading labels...')
val_labels = train_model.load_labels(session, val_instances) val_labels = train_model.load_labels(session, val_instances)
save_path = '{logdir}/results/{dataset}/path/{corpus}/'.format(
logdir=FLAGS.logdir,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
# Train the model # Train the model
print('Training the model...') print('Training the model...')
...@@ -152,13 +116,13 @@ def main(_): ...@@ -152,13 +116,13 @@ def main(_):
best_f1 = session.run(best_f1_t) best_f1 = session.run(best_f1_t)
f1 = epoch_completed(val_model, session, epoch, epoch_loss, f1 = epoch_completed(val_model, session, epoch, epoch_loss,
val_instances, val_labels, best_model_saver, val_instances, val_labels, best_model_saver,
save_path, best_f1) FLAGS.logdir, best_f1)
if f1 > best_f1: if f1 > best_f1:
session.run(assign_best_f1_op, {f1_t: f1}) session.run(assign_best_f1_op, {f1_t: f1})
if f1 < best_f1 - 0.08: if f1 < best_f1 - 0.08:
tf.logging.fino('Stopping training after %d epochs.\n' % epoch) tf.logging.info('Stopping training after %d epochs.\n' % epoch)
break break
# Print the best performance on the validation set # Print the best performance on the validation set
...@@ -170,16 +134,12 @@ def main(_): ...@@ -170,16 +134,12 @@ def main(_):
instances = train_instances + val_instances + test_instances instances = train_instances + val_instances + test_instances
path_index, path_vectors = path_model.compute_path_embeddings( path_index, path_vectors = path_model.compute_path_embeddings(
val_model, session, instances) val_model, session, instances)
path_emb_dir = '{dir}/path_embeddings/{dataset}/{corpus}/'.format(
dir=FLAGS.embeddings_base_path,
dataset=FLAGS.dataset,
corpus=FLAGS.corpus)
if not os.path.exists(path_emb_dir): if not os.path.exists(path_emb_dir):
os.makedirs(path_emb_dir) os.makedirs(path_emb_dir)
path_model.save_path_embeddings( path_model.save_path_embeddings(
val_model, path_vectors, path_index, path_emb_dir) val_model, path_vectors, path_index, FLAGS.output_dir)
def epoch_completed(model, session, epoch, epoch_loss, def epoch_completed(model, session, epoch, epoch_loss,
...@@ -214,9 +174,10 @@ def epoch_completed(model, session, epoch, epoch_loss, ...@@ -214,9 +174,10 @@ def epoch_completed(model, session, epoch, epoch_loss,
precision, recall, f1)) precision, recall, f1))
if f1 > best_f1: if f1 > best_f1:
print('Saving model in: %s' % (save_path + 'best.ckpt')) save_filename = os.path.join(save_path, 'best.ckpt')
saver.save(session, save_path + 'best.ckpt') print('Saving model in: %s' % save_filename)
print('Model saved in file: %s' % (save_path + 'best.ckpt')) saver.save(session, save_filename)
print('Model saved in file: %s' % save_filename)
return f1 return f1
......
...@@ -55,30 +55,18 @@ DIRS = '_^V<>' ...@@ -55,30 +55,18 @@ DIRS = '_^V<>'
DIR_TO_ID = {dir: did for did, dir in enumerate(DIRS)} DIR_TO_ID = {dir: did for did, dir in enumerate(DIRS)}
def load_word_embeddings(word_embeddings_dir, word_embeddings_file): def load_word_embeddings(embedding_filename):
"""Loads pretrained word embeddings from a binary file and returns the matrix. """Loads pretrained word embeddings from a binary file and returns the matrix.
Adds the <PAD>, <UNK>, <X>, and <Y> tokens to the beginning of the vocab.
Args: Args:
word_embeddings_dir: The directory for the word embeddings. embedding_filename: filename of the binary NPY data
word_embeddings_file: The pretrained word embeddings text file.
Returns: Returns:
The word embeddings matrix The word embeddings matrix
""" """
embedding_file = os.path.join(word_embeddings_dir, word_embeddings_file) embeddings = np.load(embedding_filename)
vocab_file = os.path.join(
word_embeddings_dir, os.path.dirname(word_embeddings_file), 'vocab.txt')
with open(vocab_file) as f_in:
vocab = [line.strip() for line in f_in]
vocab_size = len(vocab)
print('Embedding file "%s" has %d tokens' % (embedding_file, vocab_size))
with open(embedding_file) as f_in:
embeddings = np.load(f_in)
dim = embeddings.shape[1] dim = embeddings.shape[1]
# Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y> # Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y>
......
...@@ -330,7 +330,7 @@ def _parse_tensorflow_example(record, max_path_len, input_keep_prob): ...@@ -330,7 +330,7 @@ def _parse_tensorflow_example(record, max_path_len, input_keep_prob):
lambda: tf.zeros([1, max_path_len, 4], dtype=tf.int64)) lambda: tf.zeros([1, max_path_len, 4], dtype=tf.int64))
# Paths are left-padded. We reverse them to make them right-padded. # Paths are left-padded. We reverse them to make them right-padded.
paths = tf.reverse(paths, axis=[1]) #paths = tf.reverse(paths, axis=[1])
path_counts = tf.cond( path_counts = tf.cond(
tf.shape(path_counts)[0] > 0, tf.shape(path_counts)[0] > 0,
......
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Takes as input a sorted, tab-separated of paths to produce tf.Examples."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import itertools
import os
import sys
import tensorflow as tf
import lexnet_common
tf.flags.DEFINE_string('input', '', 'tab-separated input data')
tf.flags.DEFINE_string('vocab', '', 'a text file containing lemma vocabulary')
tf.flags.DEFINE_string('relations', '', 'a text file containing the relations')
tf.flags.DEFINE_string('output_dir', '', 'output directory')
tf.flags.DEFINE_string('splits', '', 'text file enumerating splits')
tf.flags.DEFINE_string('default_split', '', 'default split for unlabeled pairs')
tf.flags.DEFINE_string('compression', 'GZIP', 'compression for output records')
tf.flags.DEFINE_integer('max_paths', 100, 'maximum number of paths per record')
tf.flags.DEFINE_integer('max_pathlen', 8, 'maximum path length')
FLAGS = tf.flags.FLAGS
def _int64_features(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _bytes_features(value):
value = [v.encode('utf-8') if isinstance(v, unicode) else v for v in value]
return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
class CreateExampleFn(object):
def __init__(self):
# Read the vocabulary. N.B. that 0 = PAD, 1 = UNK, 2 = <X>, 3 = <Y>, hence
# the enumeration starting at 4.
with tf.gfile.GFile(FLAGS.vocab) as fh:
self.vocab = {w: ix for ix, w in enumerate(fh.read().splitlines(), start=4)}
self.vocab.update({'<PAD>': 0, '<UNK>': 1, '<X>': 2, '<Y>': 3})
# Read the relations.
with tf.gfile.GFile(FLAGS.relations) as fh:
self.relations = {r: ix for ix, r in enumerate(fh.read().splitlines())}
# Some hackery to map from SpaCy postags to Google's.
lexnet_common.POSTAG_TO_ID['PROPN'] = lexnet_common.POSTAG_TO_ID['NOUN']
lexnet_common.POSTAG_TO_ID['PRON'] = lexnet_common.POSTAG_TO_ID['NOUN']
lexnet_common.POSTAG_TO_ID['CCONJ'] = lexnet_common.POSTAG_TO_ID['CONJ']
#lexnet_common.DEPLABEL_TO_ID['relcl'] = lexnet_common.DEPLABEL_TO_ID['rel']
#lexnet_common.DEPLABEL_TO_ID['compound'] = lexnet_common.DEPLABEL_TO_ID['xcomp']
#lexnet_common.DEPLABEL_TO_ID['oprd'] = lexnet_common.DEPLABEL_TO_ID['UNK']
def __call__(self, mod, head, rel, raw_paths):
# Drop any really long paths.
paths = []
counts = []
for raw, count in raw_paths.most_common(FLAGS.max_paths):
path = raw.split('::')
if len(path) <= FLAGS.max_pathlen:
paths.append(path)
counts.append(count)
if not paths:
return None
# Compute the true length.
pathlens = [len(path) for path in paths]
# Pad each path out to max_pathlen so the LSTM can eat it.
paths = (
itertools.islice(
itertools.chain(path, itertools.repeat('<PAD>/PAD/PAD/_')),
FLAGS.max_pathlen)
for path in paths)
# Split the lemma, POS, dependency label, and direction each into a
# separate feature.
lemmas, postags, deplabels, dirs = zip(
*(part.split('/') for part in itertools.chain(*paths)))
lemmas = [self.vocab.get(lemma, 1) for lemma in lemmas]
postags = [lexnet_common.POSTAG_TO_ID[pos] for pos in postags]
deplabels = [lexnet_common.DEPLABEL_TO_ID.get(dep, 1) for dep in deplabels]
dirs = [lexnet_common.DIR_TO_ID.get(d, 0) for d in dirs]
return tf.train.Example(features=tf.train.Features(feature={
'pair': _bytes_features(['::'.join((mod, head))]),
'rel': _bytes_features([rel]),
'rel_id': _int64_features([self.relations[rel]]),
'reprs': _bytes_features(raw_paths),
'pathlens': _int64_features(pathlens),
'counts': _int64_features(counts),
'lemmas': _int64_features(lemmas),
'dirs': _int64_features(dirs),
'deplabels': _int64_features(deplabels),
'postags': _int64_features(postags),
'x_embedding_id': _int64_features([self.vocab[mod]]),
'y_embedding_id': _int64_features([self.vocab[head]]),
}))
def main(_):
# Read the splits file, if there is one.
assignments = {}
if FLAGS.splits:
with tf.gfile.GFile(FLAGS.splits) as fh:
parts = (line.split('\t') for line in fh.read().splitlines())
assignments = {(mod, head): split for mod, head, split in parts}
splits = set(assignments.itervalues())
if FLAGS.default_split:
default_split = FLAGS.default_split
splits.add(FLAGS.default_split)
elif splits:
default_split = iter(splits).next()
else:
print('Please specify --splits, --default_split, or both', file=sys.stderr)
return 1
last_mod, last_head, last_label = None, None, None
raw_paths = collections.Counter()
# Keep track of pairs we've seen to ensure that we don't get unsorted data.
seen_labeled_pairs = set()
# Set up output compression
compression_type = getattr(
tf.python_io.TFRecordCompressionType, FLAGS.compression)
options = tf.python_io.TFRecordOptions(compression_type=compression_type)
writers = {
split: tf.python_io.TFRecordWriter(
os.path.join(FLAGS.output_dir, '%s.tfrecs.gz' % split),
options=options)
for split in splits}
create_example = CreateExampleFn()
in_fh = sys.stdin if not FLAGS.input else tf.gfile.GFile(FLAGS.input)
for lineno, line in enumerate(in_fh, start=1):
if lineno % 100 == 0:
print('\rProcessed %d lines...' % lineno, end='', file=sys.stderr)
parts = line.decode('utf-8').strip().split('\t')
if len(parts) != 5:
print('Skipping line %d: %d columns (expected 5)' % (
lineno, len(parts)), file=sys.stderr)
continue
mod, head, label, raw_path, source = parts
if mod == last_mod and head == last_head and label == last_label:
raw_paths.update([raw_path])
continue
if last_mod and last_head and last_label and raw_paths:
if (last_mod, last_head, last_label) in seen_labeled_pairs:
print('It looks like the input data is not sorted; ignoring extra '
'record for (%s::%s, %s) at line %d' % (
last_mod, last_head, last_label, lineno))
else:
ex = create_example(last_mod, last_head, last_label, raw_paths)
if ex:
split = assignments.get((last_mod, last_head), default_split)
writers[split].write(ex.SerializeToString())
seen_labeled_pairs.add((last_mod, last_head, last_label))
last_mod, last_head, last_label = mod, head, label
raw_paths = collections.Counter()
if last_mod and last_head and last_label and raw_paths:
ex = create_example(last_mod, last_head, last_label, raw_paths)
if ex:
split = assignments.get((last_mod, last_head), default_split)
writers[split].write(ex.SerializeToString())
for writer in writers.itervalues():
writer.close()
if __name__ == '__main__':
tf.app.run()
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Converts a text embedding file into a binary format for quicker loading."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
tf.flags.DEFINE_string('input', '', 'text file containing embeddings')
tf.flags.DEFINE_string('output_vocab', '', 'output file for vocabulary')
tf.flags.DEFINE_string('output_npy', '', 'output file for binary')
FLAGS = tf.flags.FLAGS
def main(_):
vecs = []
vocab = []
with tf.gfile.GFile(FLAGS.input) as fh:
for line in fh:
parts = line.strip().split()
vocab.append(parts[0])
vecs.append([float(x) for x in parts[1:]])
with tf.gfile.GFile(FLAGS.output_vocab, 'w') as fh:
fh.write('\n'.join(vocab))
fh.write('\n')
vecs = np.array(vecs, dtype=np.float32)
np.save(FLAGS.output_npy, vecs, allow_pickle=False)
if __name__ == '__main__':
tf.app.run()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment