Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
1f371534
Commit
1f371534
authored
Mar 19, 2018
by
Chris Waterson
Browse files
Added lexnet_nc model.
parent
5798262b
Changes
9
Hide whitespace changes
Inline
Side-by-side
Showing
9 changed files
with
1887 additions
and
0 deletions
+1887
-0
CODEOWNERS
CODEOWNERS
+1
-0
research/README.md
research/README.md
+2
-0
research/lexnet_nc/README.md
research/lexnet_nc/README.md
+132
-0
research/lexnet_nc/get_indicative_paths.py
research/lexnet_nc/get_indicative_paths.py
+111
-0
research/lexnet_nc/learn_classifier.py
research/lexnet_nc/learn_classifier.py
+223
-0
research/lexnet_nc/learn_path_embeddings.py
research/lexnet_nc/learn_path_embeddings.py
+225
-0
research/lexnet_nc/lexnet_common.py
research/lexnet_nc/lexnet_common.py
+209
-0
research/lexnet_nc/lexnet_model.py
research/lexnet_nc/lexnet_model.py
+437
-0
research/lexnet_nc/path_model.py
research/lexnet_nc/path_model.py
+547
-0
No files found.
CODEOWNERS
View file @
1f371534
...
...
@@ -17,6 +17,7 @@
/research/inception/ @shlens @vincentvanhoucke
/research/learned_optimizer/ @olganw @nirum
/research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
/research/lexnet_nc/ @vered1986 @waterson
/research/lfads/ @jazcollins @susillo
/research/lm_1b/ @oriolvinyals @panyx0718
/research/maskgan/ @a-dai
...
...
research/README.md
View file @
1f371534
...
...
@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
-
[
inception
](
inception
)
: deep convolutional networks for computer vision.
-
[
learning_to_remember_rare_events
](
learning_to_remember_rare_events
)
: a
large-scale life-long memory module for use in deep learning.
-
[
lexnet_nc
](
lexnet_nc
)
: a distributed model for noun compound relationship
classification.
-
[
lfads
](
lfads
)
: sequential variational autoencoder for analyzing
neuroscience data.
-
[
lm_1b
](
lm_1b
)
: language modeling on the one billion word benchmark.
...
...
research/lexnet_nc/README.md
0 → 100644
View file @
1f371534
# LexNET for Noun Compound Relation Classification
This is a
[
Tensorflow
](
http://www.tensorflow.org/
)
implementation of the LexNET
algorithm for classifying relationships, specifically applied to classifying the
relationships that hold between noun compounds:
*
*olive oil*
is oil that is
*made from*
olives
*
*cooking oil*
which is oil that is
*used for*
cooking
*
*motor oil*
is oil that is
*contained in*
a motor
The model is a supervised classifier that predicts the relationship that holds
between the constituents of a two-word noun compound using:
1.
A neural "paraphrase" of each syntactic dependency path that connects the
constituents in a large corpus. For example, given a sentence like
*
This fine
oil is made from first-press olives
*
, the dependency path is something like
`oil <NSUBJPASS made PREP> from POBJ> olive`
.
2.
The distributional information provided by the individual words; i.e., the
word embeddings of the two consituents.
3.
The distributional signal provided by the compound itself; i.e., the
embedding of the noun compound in context.
The model includes several variants:
*path-based model*
uses (1) alone, the
*distributional model*
uses (2) alone, and the
*integrated model*
uses (1) and
(2). The
*distributional-nc model*
and the
*integrated-nc*
model each add (3).
Training a model requires the following:
1.
A collection of noun compounds that have been labeled using a
*
relation
inventory
*
. The inventory describes the specific relationships that you'd
like the model to differentiate (e.g.
*part of*
versus
*composed of*
versus
*purpose*
), and generally may consist of tens of classes.
2.
You'll need a collection of word embeddings: the path-based model uses the
word embeddings as part of the path representation, and the distributional
models use the word embeddings directly as prediction features.
3.
The path-based model requires a collection of syntactic dependency parses
that connect the constituents for each noun compound.
At the moment, this repository does not contain the tools for generating this
data, but we will provide references to existing datasets and plan to add tools
to generate the data in the future.
# Contents
The following source code is included here:
*
`learn_path_embeddings.py`
is a script that trains and evaluates a path-based
model to predict a noun-compound relationship given labeled noun-compounds and
dependency parse paths.
*
`learn_classifier.py`
is a script that trains and evaluates a classifier based
on any combination of paths, word embeddings, and noun-compound embeddings.
*
`get_indicative_paths.py`
is a script that generates the most indicative
syntactic dependency paths for a particular relationship.
# Dependencies
*
[
TensorFlow
](
http://www.tensorflow.org/
)
: see detailed installation
instructions at that site.
*
[
SciKit Learn
](
http://scikit-learn.org/
)
: you can probably just install this
with
`pip install sklearn`
.
# Creating the Model
This section describes the necessary steps that you must follow to reproduce the
results in the paper.
## Generate/Download Path Data
TBD! Our plan is to make the aggregate path data available that was used to
train path embeddings and classifiers; however, this will be released
separately.
## Generate/Download Embedding Data
TBD! While we used the standard Glove vectors for the relata embeddings, the NC
embeddings were generated separately. Our plan is to make that data available,
but it will be released separately.
## Create Path Embeddings
Create the path embeddings using
`learn_path_embeddings.py`
. This shell script
fragment will iterate through each dataset, split, and corpus to generate path
embeddings for each.
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
python learn_path_embeddings.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir /tmp/learn_path_embeddings
done
done
done
The path embeddings will be placed in the directory specified by
`--embeddings_base_path`
.
## Train classifiers
Train classifiers and evaluate on the validation and test data using
`train_classifiers.py`
script. This shell script fragment will iterate through
each dataset, split, corpus, and model type to train and evaluate classifiers.
LOGDIR=/tmp/learn_classifier
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
for SPLIT in random lexical_head lexical_mod lexical_full ; do
for CORPUS in wiki_gigiawords ; do
for MODEL in dist dist-nc path integrated integrated-nc ; do
# Filename for the log that will contain the classifier results.
LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
python learn_classifier.py \
--dataset_dir ~/lexnet/datasets \
--dataset "${DATASET}" \
--corpus "${SPLIT}/${CORPUS}" \
--embeddings_base_path ~/lexnet/embeddings \
--logdir ${LOGDIR} \
--input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
done
done
done
done
The log file will contain the final performance (precision, recall, F1) on the
train, dev, and test sets, and will include a confusion matrix for each.
# Contact
If you have any questions, issues, or suggestions, feel free to contact either
@vered1986 or @waterson.
research/lexnet_nc/get_indicative_paths.py
0 → 100755
View file @
1f371534
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Extracts paths that are indicative of each relation."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
os
import
tensorflow
as
tf
from
.
import
path_model
from
.
import
lexnet_common
tf
.
flags
.
DEFINE_string
(
'dataset_dir'
,
'datasets'
,
'Dataset base directory'
)
tf
.
flags
.
DEFINE_string
(
'dataset'
,
'tratz/fine_grained'
,
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir'
)
tf
.
flags
.
DEFINE_string
(
'corpus'
,
'random/wiki'
,
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset'
)
tf
.
flags
.
DEFINE_string
(
'embeddings_base_path'
,
'embeddings'
,
'Embeddings base directory'
)
tf
.
flags
.
DEFINE_string
(
'logdir'
,
'logdir'
,
'Directory of model output files'
)
tf
.
flags
.
DEFINE_integer
(
'top_k'
,
20
,
'Number of top paths to extract'
)
tf
.
flags
.
DEFINE_float
(
'threshold'
,
0.8
,
'Threshold above which to consider paths as indicative'
)
FLAGS
=
tf
.
flags
.
FLAGS
def
main
(
_
):
hparams
=
path_model
.
PathBasedModel
.
default_hparams
()
# First things first. Load the path data.
path_embeddings_file
=
'path_embeddings/{dataset}/{corpus}'
.
format
(
dataset
=
FLAGS
.
dataset
,
corpus
=
FLAGS
.
corpus
)
path_dim
=
(
hparams
.
lemma_dim
+
hparams
.
pos_dim
+
hparams
.
dep_dim
+
hparams
.
dir_dim
)
path_embeddings
,
path_to_index
=
path_model
.
load_path_embeddings
(
os
.
path
.
join
(
FLAGS
.
embeddings_base_path
,
path_embeddings_file
),
path_dim
)
# Load and count the classes so we can correctly instantiate the model.
classes_filename
=
os
.
path
.
join
(
FLAGS
.
dataset_dir
,
FLAGS
.
dataset
,
'classes.txt'
)
with
open
(
classes_filename
)
as
f_in
:
classes
=
f_in
.
read
().
splitlines
()
hparams
.
num_classes
=
len
(
classes
)
# We need the word embeddings to instantiate the model, too.
print
(
'Loading word embeddings...'
)
lemma_embeddings
=
lexnet_common
.
load_word_embeddings
(
FLAGS
.
embeddings_base_path
,
hparams
.
lemma_embeddings_file
)
# Instantiate the model.
with
tf
.
Graph
().
as_default
():
with
tf
.
variable_scope
(
'lexnet'
):
instance
=
tf
.
placeholder
(
dtype
=
tf
.
string
)
model
=
path_model
.
PathBasedModel
(
hparams
,
lemma_embeddings
,
instance
)
with
tf
.
Session
()
as
session
:
model_dir
=
'{logdir}/results/{dataset}/path/{corpus}'
.
format
(
logdir
=
FLAGS
.
logdir
,
dataset
=
FLAGS
.
dataset
,
corpus
=
FLAGS
.
corpus
)
saver
=
tf
.
train
.
Saver
()
saver
.
restore
(
session
,
os
.
path
.
join
(
model_dir
,
'best.ckpt'
))
path_model
.
get_indicative_paths
(
model
,
session
,
path_to_index
,
path_embeddings
,
classes
,
model_dir
,
FLAGS
.
top_k
,
FLAGS
.
threshold
)
if
__name__
==
'__main__'
:
tf
.
app
.
run
()
research/lexnet_nc/learn_classifier.py
0 → 100755
View file @
1f371534
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Trains the integrated LexNET classifier."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
os
import
lexnet_common
import
lexnet_model
import
path_model
from
sklearn
import
metrics
import
tensorflow
as
tf
tf
.
flags
.
DEFINE_string
(
'dataset_dir'
,
'datasets'
,
'Dataset base directory'
)
tf
.
flags
.
DEFINE_string
(
'dataset'
,
'tratz/fine_grained'
,
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir'
)
tf
.
flags
.
DEFINE_string
(
'corpus'
,
'wiki/random'
,
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset'
)
tf
.
flags
.
DEFINE_string
(
'embeddings_base_path'
,
'embeddings'
,
'Embeddings base directory'
)
tf
.
flags
.
DEFINE_string
(
'logdir'
,
'logdir'
,
'Directory of model output files'
)
tf
.
flags
.
DEFINE_string
(
'hparams'
,
''
,
'Hyper-parameters'
)
tf
.
flags
.
DEFINE_string
(
'input'
,
'integrated'
,
'The model(dist/dist-nc/path/integrated/integrated-nc'
)
FLAGS
=
tf
.
flags
.
FLAGS
def
main
(
_
):
# Pick up any one-off hyper-parameters.
hparams
=
lexnet_model
.
LexNETModel
.
default_hparams
()
hparams
.
corpus
=
FLAGS
.
corpus
hparams
.
input
=
FLAGS
.
input
hparams
.
path_embeddings_file
=
'path_embeddings/%s/%s'
%
(
FLAGS
.
dataset
,
FLAGS
.
corpus
)
input_dir
=
hparams
.
input
if
hparams
.
input
!=
'path'
else
'path_classifier'
# Set the number of classes
classes_filename
=
os
.
path
.
join
(
FLAGS
.
dataset_dir
,
FLAGS
.
dataset
,
'classes.txt'
)
with
open
(
classes_filename
)
as
f_in
:
classes
=
f_in
.
read
().
splitlines
()
hparams
.
num_classes
=
len
(
classes
)
print
(
'Model will predict into %d classes'
%
hparams
.
num_classes
)
# Get the datasets
train_set
,
val_set
,
test_set
=
(
os
.
path
.
join
(
FLAGS
.
dataset_dir
,
FLAGS
.
dataset
,
FLAGS
.
corpus
,
filename
+
'.tfrecs.gz'
)
for
filename
in
[
'train'
,
'val'
,
'test'
])
print
(
'Running with hyper-parameters: {}'
.
format
(
hparams
))
# Load the instances
print
(
'Loading instances...'
)
opts
=
tf
.
python_io
.
TFRecordOptions
(
compression_type
=
tf
.
python_io
.
TFRecordCompressionType
.
GZIP
)
train_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
train_set
,
opts
))
val_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
val_set
,
opts
))
test_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
test_set
,
opts
))
# Load the word embeddings
print
(
'Loading word embeddings...'
)
relata_embeddings
,
path_embeddings
,
nc_embeddings
,
path_to_index
=
(
None
,
None
,
None
,
None
)
if
hparams
.
input
in
[
'dist'
,
'dist-nc'
,
'integrated'
,
'integrated-nc'
]:
relata_embeddings
=
lexnet_common
.
load_word_embeddings
(
FLAGS
.
embeddings_base_path
,
hparams
.
relata_embeddings_file
)
if
hparams
.
input
in
[
'path'
,
'integrated'
,
'integrated-nc'
]:
path_embeddings
,
path_to_index
=
path_model
.
load_path_embeddings
(
os
.
path
.
join
(
FLAGS
.
embeddings_base_path
,
hparams
.
path_embeddings_file
),
hparams
.
path_dim
)
if
hparams
.
input
in
[
'dist-nc'
,
'integrated-nc'
]:
nc_embeddings
=
lexnet_common
.
load_word_embeddings
(
FLAGS
.
embeddings_base_path
,
hparams
.
nc_embeddings_file
)
# Define the graph and the model
with
tf
.
Graph
().
as_default
():
model
=
lexnet_model
.
LexNETModel
(
hparams
,
relata_embeddings
,
path_embeddings
,
nc_embeddings
,
path_to_index
)
# Initialize a session and start training
session
=
tf
.
Session
()
session
.
run
(
tf
.
global_variables_initializer
())
# Initalize the path mapping
if
hparams
.
input
in
[
'path'
,
'integrated'
,
'integrated-nc'
]:
session
.
run
(
tf
.
tables_initializer
())
session
.
run
(
model
.
initialize_path_op
,
{
model
.
path_initial_value_t
:
path_embeddings
})
# Initialize the NC embeddings
if
hparams
.
input
in
[
'dist-nc'
,
'integrated-nc'
]:
session
.
run
(
model
.
initialize_nc_op
,
{
model
.
nc_initial_value_t
:
nc_embeddings
})
# Load the labels
print
(
'Loading labels...'
)
train_labels
=
model
.
load_labels
(
session
,
train_instances
)
val_labels
=
model
.
load_labels
(
session
,
val_instances
)
test_labels
=
model
.
load_labels
(
session
,
test_instances
)
save_path
=
'{logdir}/results/{dataset}/{input}/{corpus}'
.
format
(
logdir
=
FLAGS
.
logdir
,
dataset
=
FLAGS
.
dataset
,
corpus
=
model
.
hparams
.
corpus
,
input
=
input_dir
)
if
not
os
.
path
.
exists
(
save_path
):
os
.
makedirs
(
save_path
)
# Train the model
print
(
'Training the model...'
)
model
.
fit
(
session
,
train_instances
,
epoch_completed
,
val_instances
,
val_labels
,
save_path
)
# Print the best performance on the validation set
print
(
'Best performance on the validation set: F1=%.3f'
%
epoch_completed
.
best_f1
)
# Evaluate on the train and validation sets
lexnet_common
.
full_evaluation
(
model
,
session
,
train_instances
,
train_labels
,
'Train'
,
classes
)
lexnet_common
.
full_evaluation
(
model
,
session
,
val_instances
,
val_labels
,
'Validation'
,
classes
)
test_predictions
=
lexnet_common
.
full_evaluation
(
model
,
session
,
test_instances
,
test_labels
,
'Test'
,
classes
)
# Write the test predictions to a file
predictions_file
=
os
.
path
.
join
(
save_path
,
'test_predictions.tsv'
)
print
(
'Saving test predictions to %s'
%
save_path
)
test_pairs
=
model
.
load_pairs
(
session
,
test_instances
)
lexnet_common
.
write_predictions
(
test_pairs
,
test_labels
,
test_predictions
,
classes
,
predictions_file
)
def
epoch_completed
(
model
,
session
,
epoch
,
epoch_loss
,
val_instances
,
val_labels
,
save_path
):
"""Runs every time an epoch completes.
Print the performance on the validation set, and update the saved model if
its performance is better on the previous ones. If the performance dropped,
tell the training to stop.
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
epoch: The epoch number.
epoch_loss: The current epoch loss.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
save_path: Where to save the model.
Returns:
whether the training should stop.
"""
stop_training
=
False
# Evaluate on the validation set
val_pred
=
model
.
predict
(
session
,
val_instances
)
precision
,
recall
,
f1
,
_
=
metrics
.
precision_recall_fscore_support
(
val_labels
,
val_pred
,
average
=
'weighted'
)
print
(
'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f
\n
'
%
(
epoch
+
1
,
model
.
hparams
.
num_epochs
,
epoch_loss
,
precision
,
recall
,
f1
))
# If the F1 is much smaller than the previous one, stop training. Else, if
# it's bigger, save the model.
if
f1
<
epoch_completed
.
best_f1
-
0.08
:
stop_training
=
True
if
f1
>
epoch_completed
.
best_f1
:
saver
=
tf
.
train
.
Saver
()
checkpoint_filename
=
os
.
path
.
join
(
save_path
,
'best.ckpt'
)
print
(
'Saving model in: %s'
%
checkpoint_filename
)
saver
.
save
(
session
,
checkpoint_filename
)
print
(
'Model saved in file: %s'
%
checkpoint_filename
)
epoch_completed
.
best_f1
=
f1
return
stop_training
epoch_completed
.
best_f1
=
0
if
__name__
==
'__main__'
:
tf
.
app
.
run
(
main
)
research/lexnet_nc/learn_path_embeddings.py
0 → 100755
View file @
1f371534
#!/usr/bin/env python
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Trains the LexNET path-based model."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
os
import
lexnet_common
import
path_model
from
sklearn
import
metrics
import
tensorflow
as
tf
tf
.
flags
.
DEFINE_string
(
'dataset_dir'
,
'datasets'
,
'Dataset base directory'
)
tf
.
flags
.
DEFINE_string
(
'dataset'
,
'tratz/fine_grained'
,
'Subdirectory containing the corpus directories: '
'subdirectory of dataset_dir'
)
tf
.
flags
.
DEFINE_string
(
'corpus'
,
'random/wiki_gigawords'
,
'Subdirectory containing the corpus and split: '
'subdirectory of dataset_dir/dataset'
)
tf
.
flags
.
DEFINE_string
(
'embeddings_base_path'
,
'embeddings'
,
'Embeddings base directory'
)
tf
.
flags
.
DEFINE_string
(
'logdir'
,
'logdir'
,
'Directory of model output files'
)
FLAGS
=
tf
.
flags
.
FLAGS
def
main
(
_
):
# Pick up any one-off hyper-parameters.
hparams
=
path_model
.
PathBasedModel
.
default_hparams
()
# Set the number of classes
classes_filename
=
os
.
path
.
join
(
FLAGS
.
dataset_dir
,
FLAGS
.
dataset
,
'classes.txt'
)
with
open
(
classes_filename
)
as
f_in
:
classes
=
f_in
.
read
().
splitlines
()
hparams
.
num_classes
=
len
(
classes
)
print
(
'Model will predict into %d classes'
%
hparams
.
num_classes
)
# Get the datasets
train_set
,
val_set
,
test_set
=
(
os
.
path
.
join
(
FLAGS
.
dataset_dir
,
FLAGS
.
dataset
,
FLAGS
.
corpus
,
filename
+
'.tfrecs.gz'
)
for
filename
in
[
'train'
,
'val'
,
'test'
])
print
(
'Running with hyper-parameters: {}'
.
format
(
hparams
))
# Load the instances
print
(
'Loading instances...'
)
opts
=
tf
.
python_io
.
TFRecordOptions
(
compression_type
=
tf
.
python_io
.
TFRecordCompressionType
.
GZIP
)
train_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
train_set
,
opts
))
val_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
val_set
,
opts
))
test_instances
=
list
(
tf
.
python_io
.
tf_record_iterator
(
test_set
,
opts
))
# Load the word embeddings
print
(
'Loading word embeddings...'
)
lemma_embeddings
=
lexnet_common
.
load_word_embeddings
(
FLAGS
.
embeddings_base_path
,
hparams
.
lemma_embeddings_file
)
# Define the graph and the model
with
tf
.
Graph
().
as_default
():
with
tf
.
variable_scope
(
'lexnet'
):
options
=
tf
.
python_io
.
TFRecordOptions
(
compression_type
=
tf
.
python_io
.
TFRecordCompressionType
.
GZIP
)
reader
=
tf
.
TFRecordReader
(
options
=
options
)
_
,
train_instance
=
reader
.
read
(
tf
.
train
.
string_input_producer
([
train_set
]))
shuffled_train_instance
=
tf
.
train
.
shuffle_batch
(
[
train_instance
],
batch_size
=
1
,
num_threads
=
1
,
capacity
=
len
(
train_instances
),
min_after_dequeue
=
100
,
)[
0
]
train_model
=
path_model
.
PathBasedModel
(
hparams
,
lemma_embeddings
,
shuffled_train_instance
)
with
tf
.
variable_scope
(
'lexnet'
,
reuse
=
True
):
val_instance
=
tf
.
placeholder
(
dtype
=
tf
.
string
)
val_model
=
path_model
.
PathBasedModel
(
hparams
,
lemma_embeddings
,
val_instance
)
# Initialize a session and start training
logdir
=
(
'{logdir}/results/{dataset}/path/{corpus}/supervisor.logdir'
.
format
(
logdir
=
FLAGS
.
logdir
,
dataset
=
FLAGS
.
dataset
,
corpus
=
FLAGS
.
corpus
))
best_model_saver
=
tf
.
train
.
Saver
()
f1_t
=
tf
.
placeholder
(
tf
.
float32
)
best_f1_t
=
tf
.
Variable
(
0.0
,
trainable
=
False
,
name
=
'best_f1'
)
assign_best_f1_op
=
tf
.
assign
(
best_f1_t
,
f1_t
)
supervisor
=
tf
.
train
.
Supervisor
(
logdir
=
logdir
,
global_step
=
train_model
.
global_step
)
with
supervisor
.
managed_session
()
as
session
:
# Load the labels
print
(
'Loading labels...'
)
val_labels
=
train_model
.
load_labels
(
session
,
val_instances
)
save_path
=
'{logdir}/results/{dataset}/path/{corpus}/'
.
format
(
logdir
=
FLAGS
.
logdir
,
dataset
=
FLAGS
.
dataset
,
corpus
=
FLAGS
.
corpus
)
# Train the model
print
(
'Training the model...'
)
while
True
:
step
=
session
.
run
(
train_model
.
global_step
)
epoch
=
(
step
+
len
(
train_instances
)
-
1
)
//
len
(
train_instances
)
if
epoch
>
hparams
.
num_epochs
:
break
print
(
'Starting epoch %d (step %d)...'
%
(
1
+
epoch
,
step
))
epoch_loss
=
train_model
.
run_one_epoch
(
session
,
len
(
train_instances
))
best_f1
=
session
.
run
(
best_f1_t
)
f1
=
epoch_completed
(
val_model
,
session
,
epoch
,
epoch_loss
,
val_instances
,
val_labels
,
best_model_saver
,
save_path
,
best_f1
)
if
f1
>
best_f1
:
session
.
run
(
assign_best_f1_op
,
{
f1_t
:
f1
})
if
f1
<
best_f1
-
0.08
:
tf
.
logging
.
fino
(
'Stopping training after %d epochs.
\n
'
%
epoch
)
break
# Print the best performance on the validation set
best_f1
=
session
.
run
(
best_f1_t
)
print
(
'Best performance on the validation set: F1=%.3f'
%
best_f1
)
# Save the path embeddings
print
(
'Computing the path embeddings...'
)
instances
=
train_instances
+
val_instances
+
test_instances
path_index
,
path_vectors
=
path_model
.
compute_path_embeddings
(
val_model
,
session
,
instances
)
path_emb_dir
=
'{dir}/path_embeddings/{dataset}/{corpus}/'
.
format
(
dir
=
FLAGS
.
embeddings_base_path
,
dataset
=
FLAGS
.
dataset
,
corpus
=
FLAGS
.
corpus
)
if
not
os
.
path
.
exists
(
path_emb_dir
):
os
.
makedirs
(
path_emb_dir
)
path_model
.
save_path_embeddings
(
val_model
,
path_vectors
,
path_index
,
path_emb_dir
)
def
epoch_completed
(
model
,
session
,
epoch
,
epoch_loss
,
val_instances
,
val_labels
,
saver
,
save_path
,
best_f1
):
"""Runs every time an epoch completes.
Print the performance on the validation set, and update the saved model if
its performance is better on the previous ones. If the performance dropped,
tell the training to stop.
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
epoch: The epoch number.
epoch_loss: The current epoch loss.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
saver: tf.Saver object
save_path: Where to save the model.
best_f1: the best F1 achieved so far.
Returns:
The F1 achieved on the training set.
"""
# Evaluate on the validation set
val_pred
=
model
.
predict
(
session
,
val_instances
)
precision
,
recall
,
f1
,
_
=
metrics
.
precision_recall_fscore_support
(
val_labels
,
val_pred
,
average
=
'weighted'
)
print
(
'Epoch: %d/%d, Loss: %f, validation set: P: %.3f, R: %.3f, F1: %.3f
\n
'
%
(
epoch
+
1
,
model
.
hparams
.
num_epochs
,
epoch_loss
,
precision
,
recall
,
f1
))
if
f1
>
best_f1
:
print
(
'Saving model in: %s'
%
(
save_path
+
'best.ckpt'
))
saver
.
save
(
session
,
save_path
+
'best.ckpt'
)
print
(
'Model saved in file: %s'
%
(
save_path
+
'best.ckpt'
))
return
f1
if
__name__
==
'__main__'
:
tf
.
app
.
run
(
main
)
research/lexnet_nc/lexnet_common.py
0 → 100644
View file @
1f371534
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Common stuff used with LexNET."""
# pylint: disable=bad-whitespace
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
os
import
numpy
as
np
from
sklearn
import
metrics
import
tensorflow
as
tf
# Part of speech tags used in the paths.
POSTAGS
=
[
'PAD'
,
'VERB'
,
'CONJ'
,
'NOUN'
,
'PUNCT'
,
'ADP'
,
'ADJ'
,
'DET'
,
'ADV'
,
'PART'
,
'NUM'
,
'X'
,
'INTJ'
,
'SYM'
,
]
POSTAG_TO_ID
=
{
tag
:
tid
for
tid
,
tag
in
enumerate
(
POSTAGS
)}
# Dependency labels used in the paths.
DEPLABELS
=
[
'PAD'
,
'UNK'
,
'ROOT'
,
'abbrev'
,
'acomp'
,
'advcl'
,
'advmod'
,
'agent'
,
'amod'
,
'appos'
,
'attr'
,
'aux'
,
'auxpass'
,
'cc'
,
'ccomp'
,
'complm'
,
'conj'
,
'cop'
,
'csubj'
,
'csubjpass'
,
'dep'
,
'det'
,
'dobj'
,
'expl'
,
'infmod'
,
'iobj'
,
'mark'
,
'mwe'
,
'nc'
,
'neg'
,
'nn'
,
'npadvmod'
,
'nsubj'
,
'nsubjpass'
,
'num'
,
'number'
,
'p'
,
'parataxis'
,
'partmod'
,
'pcomp'
,
'pobj'
,
'poss'
,
'preconj'
,
'predet'
,
'prep'
,
'prepc'
,
'prt'
,
'ps'
,
'purpcl'
,
'quantmod'
,
'rcmod'
,
'ref'
,
'rel'
,
'suffix'
,
'title'
,
'tmod'
,
'xcomp'
,
'xsubj'
,
]
DEPLABEL_TO_ID
=
{
label
:
lid
for
lid
,
label
in
enumerate
(
DEPLABELS
)}
# Direction codes used in the paths.
DIRS
=
'_^V<>'
DIR_TO_ID
=
{
dir
:
did
for
did
,
dir
in
enumerate
(
DIRS
)}
def
load_word_embeddings
(
word_embeddings_dir
,
word_embeddings_file
):
"""Loads pretrained word embeddings from a binary file and returns the matrix.
Args:
word_embeddings_dir: The directory for the word embeddings.
word_embeddings_file: The pretrained word embeddings text file.
Returns:
The word embeddings matrix
"""
embedding_file
=
os
.
path
.
join
(
word_embeddings_dir
,
word_embeddings_file
)
vocab_file
=
os
.
path
.
join
(
word_embeddings_dir
,
os
.
path
.
dirname
(
word_embeddings_file
),
'vocab.txt'
)
with
open
(
vocab_file
)
as
f_in
:
vocab
=
[
line
.
strip
()
for
line
in
f_in
]
vocab_size
=
len
(
vocab
)
print
(
'Embedding file "%s" has %d tokens'
%
(
embedding_file
,
vocab_size
))
with
open
(
embedding_file
)
as
f_in
:
embeddings
=
np
.
load
(
f_in
)
dim
=
embeddings
.
shape
[
1
]
# Four initially random vectors for the special tokens: <PAD>, <UNK>, <X>, <Y>
special_embeddings
=
np
.
random
.
normal
(
0
,
0.1
,
(
4
,
dim
))
embeddings
=
np
.
vstack
((
special_embeddings
,
embeddings
))
embeddings
=
embeddings
.
astype
(
np
.
float32
)
return
embeddings
def
full_evaluation
(
model
,
session
,
instances
,
labels
,
set_name
,
classes
):
"""Prints a full evaluation on the current set.
Performance (recall, precision and F1), classification report (per
class performance), and confusion matrix).
Args:
model: The currently trained path-based model.
session: The current TensorFlow session.
instances: The current set instances.
labels: The current set labels.
set_name: The current set name (train/validation/test).
classes: The class label names.
Returns:
The model's prediction for the given instances.
"""
# Predict the labels
pred
=
model
.
predict
(
session
,
instances
)
# Print the performance
precision
,
recall
,
f1
,
_
=
metrics
.
precision_recall_fscore_support
(
labels
,
pred
,
average
=
'weighted'
)
print
(
'%s set: Precision: %.3f, Recall: %.3f, F1: %.3f'
%
(
set_name
,
precision
,
recall
,
f1
))
# Print a classification report
print
(
'%s classification report:'
%
set_name
)
print
(
metrics
.
classification_report
(
labels
,
pred
,
target_names
=
classes
))
# Print the confusion matrix
print
(
'%s confusion matrix:'
%
set_name
)
cm
=
metrics
.
confusion_matrix
(
labels
,
pred
,
labels
=
range
(
len
(
classes
)))
cm
=
cm
.
astype
(
'float'
)
/
cm
.
sum
(
axis
=
1
)[:,
np
.
newaxis
]
*
100
print_cm
(
cm
,
labels
=
classes
)
return
pred
def
print_cm
(
cm
,
labels
):
"""Pretty print for confusion matrices.
From: https://gist.github.com/zachguo/10296432.
Args:
cm: The confusion matrix.
labels: The class names.
"""
columnwidth
=
10
empty_cell
=
' '
*
columnwidth
short_labels
=
[
label
[:
12
].
rjust
(
10
,
' '
)
for
label
in
labels
]
# Print header
header
=
empty_cell
+
' '
header
+=
''
.
join
([
' %{0}s '
.
format
(
columnwidth
)
%
label
for
label
in
short_labels
])
print
(
header
)
# Print rows
for
i
,
label1
in
enumerate
(
short_labels
):
row
=
'%{0}s '
.
format
(
columnwidth
)
%
label1
[:
10
]
for
j
in
range
(
len
(
short_labels
)):
value
=
int
(
cm
[
i
,
j
])
if
not
np
.
isnan
(
cm
[
i
,
j
])
else
0
cell
=
' %{0}d '
.
format
(
10
)
%
value
row
+=
cell
+
' '
print
(
row
)
def
load_all_labels
(
records
):
"""Reads TensorFlow examples from a RecordReader and returns only the labels.
Args:
records: a record list with TensorFlow examples.
Returns:
The labels
"""
curr_features
=
tf
.
parse_example
(
records
,
{
'rel_id'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
int64
),
})
labels
=
tf
.
squeeze
(
curr_features
[
'rel_id'
],
[
-
1
])
return
labels
def
load_all_pairs
(
records
):
"""Reads TensorFlow examples from a RecordReader and returns the word pairs.
Args:
records: a record list with TensorFlow examples.
Returns:
The word pairs
"""
curr_features
=
tf
.
parse_example
(
records
,
{
'pair'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
string
)
})
word_pairs
=
curr_features
[
'pair'
]
return
word_pairs
def
write_predictions
(
pairs
,
labels
,
predictions
,
classes
,
predictions_file
):
"""Write the predictions to a file.
Args:
pairs: the word pairs (list of tuple of two strings).
labels: the gold-standard labels for these pairs (array of rel ID).
predictions: the predicted labels for these pairs (array of rel ID).
classes: a list of relation names.
predictions_file: where to save the predictions.
"""
with
open
(
predictions_file
,
'w'
)
as
f_out
:
for
pair
,
label
,
pred
in
zip
(
pairs
,
labels
,
predictions
):
w1
,
w2
=
pair
f_out
.
write
(
'
\t
'
.
join
([
w1
,
w2
,
classes
[
label
],
classes
[
pred
]])
+
'
\n
'
)
research/lexnet_nc/lexnet_model.py
0 → 100644
View file @
1f371534
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""The integrated LexNET model."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
lexnet_common
import
numpy
as
np
import
tensorflow
as
tf
class
LexNETModel
(
object
):
"""The LexNET model for classifying relationships between noun compounds."""
@
classmethod
def
default_hparams
(
cls
):
"""Returns the default hyper-parameters."""
return
tf
.
contrib
.
training
.
HParams
(
batch_size
=
10
,
num_classes
=
37
,
num_epochs
=
30
,
input_keep_prob
=
0.9
,
input
=
'integrated'
,
# dist/ dist-nc/ path/ integrated/ integrated-nc
learn_relata
=
False
,
corpus
=
'wiki_gigawords'
,
random_seed
=
133
,
# zero means no random seed
relata_embeddings_file
=
'glove/glove.6B.300d.bin'
,
nc_embeddings_file
=
'nc_glove/vecs.6B.300d.bin'
,
path_embeddings_file
=
'path_embeddings/tratz/fine_grained/wiki'
,
hidden_layers
=
1
,
path_dim
=
60
)
def
__init__
(
self
,
hparams
,
relata_embeddings
,
path_embeddings
,
nc_embeddings
,
path_to_index
):
"""Initialize the LexNET classifier.
Args:
hparams: the hyper-parameters.
relata_embeddings: word embeddings for the distributional component.
path_embeddings: embeddings for the paths.
nc_embeddings: noun compound embeddings.
path_to_index: a mapping from string path to an index in the path
embeddings matrix.
"""
self
.
hparams
=
hparams
self
.
path_embeddings
=
path_embeddings
self
.
relata_embeddings
=
relata_embeddings
self
.
nc_embeddings
=
nc_embeddings
self
.
vocab_size
,
self
.
relata_dim
=
0
,
0
self
.
path_to_index
=
None
self
.
path_dim
=
0
# Set the random seed
if
hparams
.
random_seed
>
0
:
tf
.
set_random_seed
(
hparams
.
random_seed
)
# Get the vocabulary size and relata dim
if
self
.
hparams
.
input
in
[
'dist'
,
'dist-nc'
,
'integrated'
,
'integrated-nc'
]:
self
.
vocab_size
,
self
.
relata_dim
=
self
.
relata_embeddings
.
shape
# Create the mapping from string path to an index in the embeddings matrix
if
self
.
hparams
.
input
in
[
'path'
,
'integrated'
,
'integrated-nc'
]:
self
.
path_to_index
=
tf
.
contrib
.
lookup
.
HashTable
(
tf
.
contrib
.
lookup
.
KeyValueTensorInitializer
(
tf
.
constant
(
path_to_index
.
keys
()),
tf
.
constant
(
path_to_index
.
values
()),
key_dtype
=
tf
.
string
,
value_dtype
=
tf
.
int32
),
0
)
self
.
path_dim
=
self
.
path_embeddings
.
shape
[
1
]
# Create the network
self
.
__create_computation_graph__
()
def
__create_computation_graph__
(
self
):
"""Initialize the model and define the graph."""
network_input
=
0
# Define the network inputs
# Distributional x and y
if
self
.
hparams
.
input
in
[
'dist'
,
'dist-nc'
,
'integrated'
,
'integrated-nc'
]:
network_input
+=
2
*
self
.
relata_dim
self
.
relata_lookup
=
tf
.
get_variable
(
'relata_lookup'
,
initializer
=
self
.
relata_embeddings
,
dtype
=
tf
.
float32
,
trainable
=
self
.
hparams
.
learn_relata
)
# Path-based
if
self
.
hparams
.
input
in
[
'path'
,
'integrated'
,
'integrated-nc'
]:
network_input
+=
self
.
path_dim
self
.
path_initial_value_t
=
tf
.
placeholder
(
tf
.
float32
,
None
)
self
.
path_lookup
=
tf
.
get_variable
(
name
=
'path_lookup'
,
dtype
=
tf
.
float32
,
trainable
=
False
,
shape
=
self
.
path_embeddings
.
shape
)
self
.
initialize_path_op
=
tf
.
assign
(
self
.
path_lookup
,
self
.
path_initial_value_t
,
validate_shape
=
False
)
# Distributional noun compound
if
self
.
hparams
.
input
in
[
'dist-nc'
,
'integrated-nc'
]:
network_input
+=
self
.
relata_dim
self
.
nc_initial_value_t
=
tf
.
placeholder
(
tf
.
float32
,
None
)
self
.
nc_lookup
=
tf
.
get_variable
(
name
=
'nc_lookup'
,
dtype
=
tf
.
float32
,
trainable
=
False
,
shape
=
self
.
nc_embeddings
.
shape
)
self
.
initialize_nc_op
=
tf
.
assign
(
self
.
nc_lookup
,
self
.
nc_initial_value_t
,
validate_shape
=
False
)
hidden_dim
=
network_input
//
2
# Define the MLP
if
self
.
hparams
.
hidden_layers
==
0
:
self
.
weights1
=
tf
.
get_variable
(
'W1'
,
shape
=
[
network_input
,
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
self
.
bias1
=
tf
.
get_variable
(
'b1'
,
shape
=
[
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
elif
self
.
hparams
.
hidden_layers
==
1
:
self
.
weights1
=
tf
.
get_variable
(
'W1'
,
shape
=
[
network_input
,
hidden_dim
],
dtype
=
tf
.
float32
)
self
.
bias1
=
tf
.
get_variable
(
'b1'
,
shape
=
[
hidden_dim
],
dtype
=
tf
.
float32
)
self
.
weights2
=
tf
.
get_variable
(
'W2'
,
shape
=
[
hidden_dim
,
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
self
.
bias2
=
tf
.
get_variable
(
'b2'
,
shape
=
[
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
else
:
raise
ValueError
(
'Only 0 or 1 hidden layers are supported'
)
# Define the variables
self
.
instances
=
tf
.
placeholder
(
dtype
=
tf
.
string
,
shape
=
[
self
.
hparams
.
batch_size
])
(
self
.
x_embedding_id
,
self
.
y_embedding_id
,
self
.
nc_embedding_id
,
self
.
path_embedding_id
,
self
.
path_counts
,
self
.
labels
)
=
parse_tensorflow_examples
(
self
.
instances
,
self
.
hparams
.
batch_size
,
self
.
path_to_index
)
# Create the MLP
self
.
__mlp__
()
self
.
instances_to_load
=
tf
.
placeholder
(
dtype
=
tf
.
string
,
shape
=
[
None
])
self
.
labels_to_load
=
lexnet_common
.
load_all_labels
(
self
.
instances_to_load
)
self
.
pairs_to_load
=
lexnet_common
.
load_all_pairs
(
self
.
instances_to_load
)
def
load_labels
(
self
,
session
,
instances
):
"""Loads the labels for these instances.
Args:
session: The current TensorFlow session,
instances: The instances for which to load the labels.
Returns:
the labels of these instances.
"""
return
session
.
run
(
self
.
labels_to_load
,
feed_dict
=
{
self
.
instances_to_load
:
instances
})
def
load_pairs
(
self
,
session
,
instances
):
"""Loads the word pairs for these instances.
Args:
session: The current TensorFlow session,
instances: The instances for which to load the labels.
Returns:
the word pairs of these instances.
"""
word_pairs
=
session
.
run
(
self
.
pairs_to_load
,
feed_dict
=
{
self
.
instances_to_load
:
instances
})
return
[
pair
[
0
].
split
(
'::'
)
for
pair
in
word_pairs
]
def
__train_single_batch__
(
self
,
session
,
batch_instances
):
"""Train a single batch.
Args:
session: The current TensorFlow session.
batch_instances: TensorFlow examples containing the training intances
Returns:
The cost for the current batch.
"""
cost
,
_
=
session
.
run
([
self
.
cost
,
self
.
train_op
],
feed_dict
=
{
self
.
instances
:
batch_instances
})
return
cost
def
fit
(
self
,
session
,
inputs
,
on_epoch_completed
,
val_instances
,
val_labels
,
save_path
):
"""Train the model.
Args:
session: The current TensorFlow session.
inputs:
on_epoch_completed: A method to call after each epoch.
val_instances: The validation set instances (evaluation between epochs).
val_labels: The validation set labels (for evaluation between epochs).
save_path: Where to save the model.
"""
for
epoch
in
range
(
self
.
hparams
.
num_epochs
):
losses
=
[]
epoch_indices
=
list
(
np
.
random
.
permutation
(
len
(
inputs
)))
# If the number of instances doesn't divide by batch_size, enlarge it
# by duplicating training examples
mod
=
len
(
epoch_indices
)
%
self
.
hparams
.
batch_size
if
mod
>
0
:
epoch_indices
.
extend
([
np
.
random
.
randint
(
0
,
high
=
len
(
inputs
))]
*
mod
)
# Define the batches
n_batches
=
len
(
epoch_indices
)
//
self
.
hparams
.
batch_size
for
minibatch
in
range
(
n_batches
):
batch_indices
=
epoch_indices
[
minibatch
*
self
.
hparams
.
batch_size
:(
minibatch
+
1
)
*
self
.
hparams
.
batch_size
]
batch_instances
=
[
inputs
[
i
]
for
i
in
batch_indices
]
loss
=
self
.
__train_single_batch__
(
session
,
batch_instances
)
losses
.
append
(
loss
)
epoch_loss
=
np
.
nanmean
(
losses
)
if
on_epoch_completed
:
should_stop
=
on_epoch_completed
(
self
,
session
,
epoch
,
epoch_loss
,
val_instances
,
val_labels
,
save_path
)
if
should_stop
:
print
(
'Stopping training after %d epochs.'
%
epoch
)
return
def
predict
(
self
,
session
,
inputs
):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the train paths, x, y and/or nc vectors
Returns:
The test predictions.
"""
predictions
,
_
=
zip
(
*
self
.
predict_with_score
(
session
,
inputs
))
return
np
.
array
(
predictions
)
def
predict_with_score
(
self
,
session
,
inputs
):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the test paths, x, y and/or nc vectors
Returns:
The test predictions along with their scores.
"""
test_pred
=
[
0
]
*
len
(
inputs
)
for
chunk
in
xrange
(
0
,
len
(
test_pred
),
self
.
hparams
.
batch_size
):
# Initialize the variables with the current batch data
batch_indices
=
list
(
range
(
chunk
,
min
(
chunk
+
self
.
hparams
.
batch_size
,
len
(
test_pred
))))
# If the batch is too small, add a few other examples
if
len
(
batch_indices
)
<
self
.
hparams
.
batch_size
:
batch_indices
+=
[
0
]
*
(
self
.
hparams
.
batch_size
-
len
(
batch_indices
))
batch_instances
=
[
inputs
[
i
]
for
i
in
batch_indices
]
predictions
,
scores
=
session
.
run
(
[
self
.
predictions
,
self
.
scores
],
feed_dict
=
{
self
.
instances
:
batch_instances
})
for
index_in_batch
,
index_in_dataset
in
enumerate
(
batch_indices
):
prediction
=
predictions
[
index_in_batch
]
score
=
scores
[
index_in_batch
][
prediction
]
test_pred
[
index_in_dataset
]
=
(
prediction
,
score
)
return
test_pred
def
__mlp__
(
self
):
"""Performs the MLP operations.
Returns: the prediction object to be computed in a Session
"""
# Define the operations
# Network input
vec_inputs
=
[]
# Distributional component
if
self
.
hparams
.
input
in
[
'dist'
,
'dist-nc'
,
'integrated'
,
'integrated-nc'
]:
for
emb_id
in
[
self
.
x_embedding_id
,
self
.
y_embedding_id
]:
vec_inputs
.
append
(
tf
.
nn
.
embedding_lookup
(
self
.
relata_lookup
,
emb_id
))
# Noun compound component
if
self
.
hparams
.
input
in
[
'dist-nc'
,
'integrated-nc'
]:
vec
=
tf
.
nn
.
embedding_lookup
(
self
.
nc_lookup
,
self
.
nc_embedding_id
)
vec_inputs
.
append
(
vec
)
# Path-based component
if
self
.
hparams
.
input
in
[
'path'
,
'integrated'
,
'integrated-nc'
]:
# Get the current paths for each batch instance
self
.
path_embeddings
=
tf
.
nn
.
embedding_lookup
(
self
.
path_lookup
,
self
.
path_embedding_id
)
# self.path_embeddings is of shape
# [batch_size, max_path_per_instance, output_dim]
# We need to multiply it by path counts
# ([batch_size, max_path_per_instance]).
# Start by duplicating path_counts along the output_dim axis.
self
.
path_freq
=
tf
.
tile
(
tf
.
expand_dims
(
self
.
path_counts
,
-
1
),
[
1
,
1
,
self
.
path_dim
])
# Compute the averaged path vector for each instance.
# First, multiply the path embeddings and frequencies element-wise.
self
.
weighted
=
tf
.
multiply
(
self
.
path_freq
,
self
.
path_embeddings
)
# Second, take the sum to get a tensor of shape [batch_size, output_dim].
self
.
pair_path_embeddings
=
tf
.
reduce_sum
(
self
.
weighted
,
1
)
# Finally, divide by the total number of paths.
# The number of paths for each pair has a shape [batch_size, 1],
# We duplicate it output_dim times along the second axis.
self
.
num_paths
=
tf
.
clip_by_value
(
tf
.
reduce_sum
(
self
.
path_counts
,
1
),
1
,
np
.
inf
)
self
.
num_paths
=
tf
.
tile
(
tf
.
expand_dims
(
self
.
num_paths
,
-
1
),
[
1
,
self
.
path_dim
])
# And finally, divide pair_path_embeddings by num_paths element-wise.
self
.
pair_path_embeddings
=
tf
.
div
(
self
.
pair_path_embeddings
,
self
.
num_paths
)
vec_inputs
.
append
(
self
.
pair_path_embeddings
)
# Concatenate the inputs and feed to the MLP
self
.
input_vec
=
tf
.
nn
.
dropout
(
tf
.
concat
(
vec_inputs
,
1
),
keep_prob
=
self
.
hparams
.
input_keep_prob
)
h
=
tf
.
matmul
(
self
.
input_vec
,
self
.
weights1
)
self
.
output
=
h
if
self
.
hparams
.
hidden_layers
==
1
:
self
.
output
=
tf
.
matmul
(
tf
.
nn
.
tanh
(
h
),
self
.
weights2
)
self
.
scores
=
self
.
output
self
.
predictions
=
tf
.
argmax
(
self
.
scores
,
axis
=
1
)
# Define the loss function and the optimization algorithm
self
.
cross_entropies
=
tf
.
nn
.
sparse_softmax_cross_entropy_with_logits
(
logits
=
self
.
scores
,
labels
=
self
.
labels
)
self
.
cost
=
tf
.
reduce_sum
(
self
.
cross_entropies
,
name
=
'cost'
)
self
.
global_step
=
tf
.
Variable
(
0
,
name
=
'global_step'
,
trainable
=
False
)
self
.
optimizer
=
tf
.
train
.
AdamOptimizer
()
self
.
train_op
=
self
.
optimizer
.
minimize
(
self
.
cost
,
global_step
=
self
.
global_step
)
def
parse_tensorflow_examples
(
record
,
batch_size
,
path_to_index
):
"""Reads TensorFlow examples from a RecordReader.
Args:
record: a record with TensorFlow examples.
batch_size: the number of instances in a minibatch
path_to_index: mapping from string path to index in the embeddings matrix.
Returns:
The word embeddings IDs, paths and counts
"""
features
=
tf
.
parse_example
(
record
,
{
'x_embedding_id'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
int64
),
'y_embedding_id'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
int64
),
'nc_embedding_id'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
int64
),
'reprs'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
string
,
allow_missing
=
True
),
'counts'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'rel_id'
:
tf
.
FixedLenFeature
([
1
],
dtype
=
tf
.
int64
)
})
x_embedding_id
=
tf
.
squeeze
(
features
[
'x_embedding_id'
],
[
-
1
])
y_embedding_id
=
tf
.
squeeze
(
features
[
'y_embedding_id'
],
[
-
1
])
nc_embedding_id
=
tf
.
squeeze
(
features
[
'nc_embedding_id'
],
[
-
1
])
labels
=
tf
.
squeeze
(
features
[
'rel_id'
],
[
-
1
])
path_counts
=
tf
.
to_float
(
tf
.
reshape
(
features
[
'counts'
],
[
batch_size
,
-
1
]))
path_embedding_id
=
None
if
path_to_index
:
path_embedding_id
=
path_to_index
.
lookup
(
features
[
'reprs'
])
return
(
x_embedding_id
,
y_embedding_id
,
nc_embedding_id
,
path_embedding_id
,
path_counts
,
labels
)
research/lexnet_nc/path_model.py
0 → 100644
View file @
1f371534
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""LexNET Path-based Model."""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
collections
import
itertools
import
os
import
lexnet_common
import
numpy
as
np
import
tensorflow
as
tf
class
PathBasedModel
(
object
):
"""The LexNET path-based model for classifying semantic relations."""
@
classmethod
def
default_hparams
(
cls
):
"""Returns the default hyper-parameters."""
return
tf
.
contrib
.
training
.
HParams
(
max_path_len
=
8
,
num_classes
=
37
,
num_epochs
=
30
,
input_keep_prob
=
0.9
,
learning_rate
=
0.001
,
learn_lemmas
=
False
,
random_seed
=
133
,
# zero means no random seed
lemma_embeddings_file
=
'glove/glove.6B.50d.bin'
,
num_pos
=
len
(
lexnet_common
.
POSTAGS
),
num_dep
=
len
(
lexnet_common
.
DEPLABELS
),
num_directions
=
len
(
lexnet_common
.
DIRS
),
lemma_dim
=
50
,
pos_dim
=
4
,
dep_dim
=
5
,
dir_dim
=
1
)
def
__init__
(
self
,
hparams
,
lemma_embeddings
,
instance
):
"""Initialize the LexNET classifier.
Args:
hparams: the hyper-parameters.
lemma_embeddings: word embeddings for the path-based component.
instance: string tensor containing the input instance
"""
self
.
hparams
=
hparams
self
.
lemma_embeddings
=
lemma_embeddings
self
.
instance
=
instance
self
.
vocab_size
,
self
.
lemma_dim
=
self
.
lemma_embeddings
.
shape
# Set the random seed
if
hparams
.
random_seed
>
0
:
tf
.
set_random_seed
(
hparams
.
random_seed
)
# Create the network
self
.
__create_computation_graph__
()
def
__create_computation_graph__
(
self
):
"""Initialize the model and define the graph."""
self
.
lstm_input_dim
=
sum
([
self
.
hparams
.
lemma_dim
,
self
.
hparams
.
pos_dim
,
self
.
hparams
.
dep_dim
,
self
.
hparams
.
dir_dim
])
self
.
lstm_output_dim
=
self
.
lstm_input_dim
network_input
=
self
.
lstm_output_dim
self
.
lemma_lookup
=
tf
.
get_variable
(
'lemma_lookup'
,
initializer
=
self
.
lemma_embeddings
,
dtype
=
tf
.
float32
,
trainable
=
self
.
hparams
.
learn_lemmas
)
self
.
pos_lookup
=
tf
.
get_variable
(
'pos_lookup'
,
shape
=
[
self
.
hparams
.
num_pos
,
self
.
hparams
.
pos_dim
],
dtype
=
tf
.
float32
)
self
.
dep_lookup
=
tf
.
get_variable
(
'dep_lookup'
,
shape
=
[
self
.
hparams
.
num_dep
,
self
.
hparams
.
dep_dim
],
dtype
=
tf
.
float32
)
self
.
dir_lookup
=
tf
.
get_variable
(
'dir_lookup'
,
shape
=
[
self
.
hparams
.
num_directions
,
self
.
hparams
.
dir_dim
],
dtype
=
tf
.
float32
)
self
.
weights1
=
tf
.
get_variable
(
'W1'
,
shape
=
[
network_input
,
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
self
.
bias1
=
tf
.
get_variable
(
'b1'
,
shape
=
[
self
.
hparams
.
num_classes
],
dtype
=
tf
.
float32
)
# Define the variables
(
self
.
batch_paths
,
self
.
path_counts
,
self
.
seq_lengths
,
self
.
path_strings
,
self
.
batch_labels
)
=
_parse_tensorflow_example
(
self
.
instance
,
self
.
hparams
.
max_path_len
,
self
.
hparams
.
input_keep_prob
)
# Create the LSTM
self
.
__lstm__
()
# Create the MLP
self
.
__mlp__
()
self
.
instances_to_load
=
tf
.
placeholder
(
dtype
=
tf
.
string
,
shape
=
[
None
])
self
.
labels_to_load
=
lexnet_common
.
load_all_labels
(
self
.
instances_to_load
)
def
load_labels
(
self
,
session
,
batch_instances
):
"""Loads the labels of the current instances.
Args:
session: the current TensorFlow session.
batch_instances: the dataset instances.
Returns:
the labels.
"""
return
session
.
run
(
self
.
labels_to_load
,
feed_dict
=
{
self
.
instances_to_load
:
batch_instances
})
def
run_one_epoch
(
self
,
session
,
num_steps
):
"""Train the model.
Args:
session: The current TensorFlow session.
num_steps: The number of steps in each epoch.
Returns:
The mean loss for the epoch.
Raises:
ArithmeticError: if the loss becomes non-finite.
"""
losses
=
[]
for
step
in
range
(
num_steps
):
curr_loss
,
_
=
session
.
run
([
self
.
cost
,
self
.
train_op
])
if
not
np
.
isfinite
(
curr_loss
):
raise
ArithmeticError
(
'nan loss at step %d'
%
step
)
losses
.
append
(
curr_loss
)
return
np
.
mean
(
losses
)
def
predict
(
self
,
session
,
inputs
):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the train paths, x, y and/or nc vectors
Returns:
The test predictions.
"""
predictions
,
_
=
zip
(
*
self
.
predict_with_score
(
session
,
inputs
))
return
np
.
array
(
predictions
)
def
predict_with_score
(
self
,
session
,
inputs
):
"""Predict the classification of the test set.
Args:
session: The current TensorFlow session.
inputs: the test paths, x, y and/or nc vectors
Returns:
The test predictions along with their scores.
"""
test_pred
=
[
0
]
*
len
(
inputs
)
for
index
,
instance
in
enumerate
(
inputs
):
prediction
,
scores
=
session
.
run
(
[
self
.
predictions
,
self
.
scores
],
feed_dict
=
{
self
.
instance
:
instance
})
test_pred
[
index
]
=
(
prediction
,
scores
[
prediction
])
return
test_pred
def
__mlp__
(
self
):
"""Performs the MLP operations.
Returns: the prediction object to be computed in a Session
"""
# Feed the paths to the MLP: path_embeddings is
# [num_batch_paths, output_dim], and when we multiply it by W
# ([output_dim, num_classes]), we get a matrix of class distributions:
# [num_batch_paths, num_classes].
self
.
distributions
=
tf
.
matmul
(
self
.
path_embeddings
,
self
.
weights1
)
# Now, compute weighted average on the class distributions, using the path
# frequency as weights.
# First, reshape path_freq to the same shape of distributions
self
.
path_freq
=
tf
.
tile
(
tf
.
expand_dims
(
self
.
path_counts
,
-
1
),
[
1
,
self
.
hparams
.
num_classes
])
# Second, multiply the distributions and frequencies element-wise.
self
.
weighted
=
tf
.
multiply
(
self
.
path_freq
,
self
.
distributions
)
# Finally, take the average to get a tensor of shape [1, num_classes].
self
.
weighted_sum
=
tf
.
reduce_sum
(
self
.
weighted
,
0
)
self
.
num_paths
=
tf
.
clip_by_value
(
tf
.
reduce_sum
(
self
.
path_counts
),
1
,
np
.
inf
)
self
.
num_paths
=
tf
.
tile
(
tf
.
expand_dims
(
self
.
num_paths
,
-
1
),
[
self
.
hparams
.
num_classes
])
self
.
scores
=
tf
.
div
(
self
.
weighted_sum
,
self
.
num_paths
)
self
.
predictions
=
tf
.
argmax
(
self
.
scores
)
# Define the loss function and the optimization algorithm
self
.
cross_entropies
=
tf
.
nn
.
sparse_softmax_cross_entropy_with_logits
(
logits
=
self
.
scores
,
labels
=
tf
.
reduce_mean
(
self
.
batch_labels
))
self
.
cost
=
tf
.
reduce_sum
(
self
.
cross_entropies
,
name
=
'cost'
)
self
.
global_step
=
tf
.
Variable
(
0
,
name
=
'global_step'
,
trainable
=
False
)
self
.
optimizer
=
tf
.
train
.
AdamOptimizer
()
self
.
train_op
=
self
.
optimizer
.
minimize
(
self
.
cost
,
global_step
=
self
.
global_step
)
def
__lstm__
(
self
):
"""Defines the LSTM operations.
Returns:
A matrix of path embeddings.
"""
lookup_tables
=
[
self
.
lemma_lookup
,
self
.
pos_lookup
,
self
.
dep_lookup
,
self
.
dir_lookup
]
# Split the edges to components: list of 4 tensors
# [num_batch_paths, max_path_len, 1]
self
.
edge_components
=
tf
.
split
(
self
.
batch_paths
,
4
,
axis
=
2
)
# Look up the components embeddings and concatenate them back together
self
.
path_matrix
=
tf
.
concat
([
tf
.
squeeze
(
tf
.
nn
.
embedding_lookup
(
lookup_table
,
component
),
2
)
for
lookup_table
,
component
in
zip
(
lookup_tables
,
self
.
edge_components
)
],
axis
=
2
)
self
.
sequence_lengths
=
tf
.
reshape
(
self
.
seq_lengths
,
[
-
1
])
# Define the LSTM.
# The input is [num_batch_paths, max_path_len, input_dim].
lstm_cell
=
tf
.
contrib
.
rnn
.
BasicLSTMCell
(
self
.
lstm_output_dim
)
# The output is [num_batch_paths, max_path_len, output_dim].
self
.
lstm_outputs
,
_
=
tf
.
nn
.
dynamic_rnn
(
lstm_cell
,
self
.
path_matrix
,
dtype
=
tf
.
float32
,
sequence_length
=
self
.
sequence_lengths
)
# Slice the last *relevant* output for each instance ->
# [num_batch_paths, output_dim]
self
.
path_embeddings
=
_extract_last_relevant
(
self
.
lstm_outputs
,
self
.
sequence_lengths
)
def
_parse_tensorflow_example
(
record
,
max_path_len
,
input_keep_prob
):
"""Reads TensorFlow examples from a RecordReader.
Args:
record: a record with TensorFlow example.
max_path_len: the maximum path length.
input_keep_prob: 1 - the word dropout probability
Returns:
The paths and counts
"""
features
=
tf
.
parse_single_example
(
record
,
{
'lemmas'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'postags'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'deplabels'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'dirs'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'counts'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'pathlens'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
int64
,
allow_missing
=
True
),
'reprs'
:
tf
.
FixedLenSequenceFeature
(
shape
=
(),
dtype
=
tf
.
string
,
allow_missing
=
True
),
'rel_id'
:
tf
.
FixedLenFeature
([],
dtype
=
tf
.
int64
)
})
path_counts
=
tf
.
to_float
(
features
[
'counts'
])
seq_lengths
=
features
[
'pathlens'
]
# Concatenate the edge components to create a path tensor:
# [max_paths_per_ins, max_path_length, 4]
lemmas
=
_word_dropout
(
tf
.
reshape
(
features
[
'lemmas'
],
[
-
1
,
max_path_len
]),
input_keep_prob
)
paths
=
tf
.
stack
(
[
lemmas
]
+
[
tf
.
reshape
(
features
[
f
],
[
-
1
,
max_path_len
])
for
f
in
(
'postags'
,
'deplabels'
,
'dirs'
)
],
axis
=-
1
)
path_strings
=
features
[
'reprs'
]
# Add an empty path to pairs with no paths
paths
=
tf
.
cond
(
tf
.
shape
(
paths
)[
0
]
>
0
,
lambda
:
paths
,
lambda
:
tf
.
zeros
([
1
,
max_path_len
,
4
],
dtype
=
tf
.
int64
))
# Paths are left-padded. We reverse them to make them right-padded.
paths
=
tf
.
reverse
(
paths
,
axis
=
[
1
])
path_counts
=
tf
.
cond
(
tf
.
shape
(
path_counts
)[
0
]
>
0
,
lambda
:
path_counts
,
lambda
:
tf
.
constant
([
1.0
],
dtype
=
tf
.
float32
))
seq_lengths
=
tf
.
cond
(
tf
.
shape
(
seq_lengths
)[
0
]
>
0
,
lambda
:
seq_lengths
,
lambda
:
tf
.
constant
([
1
],
dtype
=
tf
.
int64
))
# Duplicate the label for each path
labels
=
tf
.
ones_like
(
path_counts
,
dtype
=
tf
.
int64
)
*
features
[
'rel_id'
]
return
paths
,
path_counts
,
seq_lengths
,
path_strings
,
labels
def
_extract_last_relevant
(
output
,
seq_lengths
):
"""Get the last relevant LSTM output cell for each batch instance.
Args:
output: the LSTM outputs - a tensor with shape
[num_paths, output_dim, max_path_len]
seq_lengths: the sequences length per instance
Returns:
The last relevant LSTM output cell for each batch instance.
"""
max_length
=
int
(
output
.
get_shape
()[
1
])
path_lengths
=
tf
.
clip_by_value
(
seq_lengths
-
1
,
0
,
max_length
)
relevant
=
tf
.
reduce_sum
(
tf
.
multiply
(
output
,
tf
.
expand_dims
(
tf
.
one_hot
(
path_lengths
,
max_length
),
-
1
)),
1
)
return
relevant
def
_word_dropout
(
words
,
input_keep_prob
):
"""Drops words with probability 1 - input_keep_prob.
Args:
words: a list of lemmas from the paths.
input_keep_prob: the probability to keep the word.
Returns:
The revised list where some of the words are <UNK>ed.
"""
# Create the mask: (-1) to drop, 1 to keep
prob
=
tf
.
random_uniform
(
tf
.
shape
(
words
),
0
,
1
)
condition
=
tf
.
less
(
prob
,
(
1
-
input_keep_prob
))
mask
=
tf
.
where
(
condition
,
tf
.
negative
(
tf
.
ones_like
(
words
)),
tf
.
ones_like
(
words
))
# We need to keep zeros (<PAD>), and change other numbers to 1 (<UNK>)
# if their mask is -1. First, we multiply the mask and the words.
# Zeros will stay zeros, and words to drop will become negative.
# Then, we change negative values to 1.
masked_words
=
tf
.
multiply
(
mask
,
words
)
condition
=
tf
.
less
(
masked_words
,
0
)
dropped_words
=
tf
.
where
(
condition
,
tf
.
ones_like
(
words
),
words
)
return
dropped_words
def
compute_path_embeddings
(
model
,
session
,
instances
):
"""Compute the path embeddings for all the distinct paths.
Args:
model: The trained path-based model.
session: The current TensorFlow session.
instances: All the train, test and validation instances.
Returns:
The path to ID index and the path embeddings.
"""
# Get an index for each distinct path
path_index
=
collections
.
defaultdict
(
itertools
.
count
(
0
).
next
)
path_vectors
=
{}
for
instance
in
instances
:
curr_path_embeddings
,
curr_path_strings
=
session
.
run
(
[
model
.
path_embeddings
,
model
.
path_strings
],
feed_dict
=
{
model
.
instance
:
instance
})
for
i
,
path
in
enumerate
(
curr_path_strings
):
if
not
path
:
continue
# Set a new/existing index for the path
index
=
path_index
[
path
]
# Save its vector
path_vectors
[
index
]
=
curr_path_embeddings
[
i
,
:]
print
(
'Number of distinct paths: %d'
%
len
(
path_index
))
return
path_index
,
path_vectors
def
save_path_embeddings
(
model
,
path_vectors
,
path_index
,
embeddings_base_path
):
"""Saves the path embeddings.
Args:
model: The trained path-based model.
path_vectors: The path embeddings.
path_index: A map from path to ID.
embeddings_base_path: The base directory where the embeddings are.
"""
index_range
=
range
(
max
(
path_index
.
values
())
+
1
)
path_matrix
=
[
path_vectors
[
i
]
for
i
in
index_range
]
path_matrix
=
np
.
vstack
(
path_matrix
)
# Save the path embeddings
path_vector_filename
=
os
.
path
.
join
(
embeddings_base_path
,
'%d_path_vectors'
%
model
.
lstm_output_dim
)
with
open
(
path_vector_filename
,
'w'
)
as
f_out
:
np
.
save
(
f_out
,
path_matrix
)
index_to_path
=
{
i
:
p
for
p
,
i
in
path_index
.
iteritems
()}
path_vocab
=
[
index_to_path
[
i
]
for
i
in
index_range
]
# Save the path vocabulary
path_vocab_filename
=
os
.
path
.
join
(
embeddings_base_path
,
'%d_path_vocab'
%
model
.
lstm_output_dim
)
with
open
(
path_vocab_filename
,
'w'
)
as
f_out
:
f_out
.
write
(
'
\n
'
.
join
(
path_vocab
))
f_out
.
write
(
'
\n
'
)
print
(
'Saved path embeddings.'
)
def
load_path_embeddings
(
path_embeddings_dir
,
path_dim
):
"""Loads pretrained path embeddings from a binary file and returns the matrix.
Args:
path_embeddings_dir: The directory for the path embeddings.
path_dim: The dimension of the path embeddings, used as prefix to the
path_vocab and path_vectors files.
Returns:
The path embeddings matrix and the path_to_index dictionary.
"""
prefix
=
path_embeddings_dir
+
'/%d'
%
path_dim
+
'_'
with
open
(
prefix
+
'path_vocab'
)
as
f_in
:
vocab
=
f_in
.
read
().
splitlines
()
vocab_size
=
len
(
vocab
)
embedding_file
=
prefix
+
'path_vectors'
print
(
'Embedding file "%s" has %d paths'
%
(
embedding_file
,
vocab_size
))
with
open
(
embedding_file
)
as
f_in
:
embeddings
=
np
.
load
(
f_in
)
path_to_index
=
{
p
:
i
for
i
,
p
in
enumerate
(
vocab
)}
return
embeddings
,
path_to_index
def
get_indicative_paths
(
model
,
session
,
path_index
,
path_vectors
,
classes
,
save_dir
,
k
=
20
,
threshold
=
0.8
):
"""Gets the most indicative paths for each class.
Args:
model: The trained path-based model.
session: The current TensorFlow session.
path_index: A map from path to ID.
path_vectors: The path embeddings.
classes: The class label names.
save_dir: Where to save the paths.
k: The k for top-k paths.
threshold: The threshold above which to consider paths as indicative.
"""
# Define graph variables for this operation
p_path_embedding
=
tf
.
placeholder
(
dtype
=
tf
.
float32
,
shape
=
[
1
,
model
.
lstm_output_dim
])
p_distributions
=
tf
.
nn
.
softmax
(
tf
.
matmul
(
p_path_embedding
,
model
.
weights1
))
# Treat each path as a pair instance with a single path, and get the
# relation distribution for it. Then, take the top paths for each relation.
# This dictionary contains a relation as a key, and the value is a list of
# tuples of path index and score. A relation r will contain (p, s) if the
# path p is classified to r with a confidence of s.
prediction_per_relation
=
collections
.
defaultdict
(
list
)
index_to_path
=
{
i
:
p
for
p
,
i
in
path_index
.
iteritems
()}
# Predict all the paths
for
index
in
range
(
len
(
path_index
)):
curr_path_vector
=
path_vectors
[
index
]
distribution
=
session
.
run
(
p_distributions
,
feed_dict
=
{
p_path_embedding
:
np
.
reshape
(
curr_path_vector
,
[
1
,
model
.
lstm_output_dim
])})
distribution
=
distribution
[
0
,
:]
prediction
=
np
.
argmax
(
distribution
)
prediction_per_relation
[
prediction
].
append
(
(
index
,
distribution
[
prediction
]))
if
index
%
10000
==
0
:
print
(
'Classified %d/%d (%3.2f%%) of the paths'
%
(
index
,
len
(
path_index
),
100
*
index
/
len
(
path_index
)))
# Retrieve k-best scoring paths for each relation
for
relation_index
,
relation
in
enumerate
(
classes
):
curr_paths
=
sorted
(
prediction_per_relation
[
relation_index
],
key
=
lambda
item
:
item
[
1
],
reverse
=
True
)
above_t
=
[(
p
,
s
)
for
(
p
,
s
)
in
curr_paths
if
s
>=
threshold
]
top_k
=
curr_paths
[
k
+
1
]
relation_paths
=
above_t
if
len
(
above_t
)
>
len
(
top_k
)
else
top_k
paths_filename
=
os
.
path
.
join
(
save_dir
,
'%s.paths'
%
relation
)
with
open
(
paths_filename
,
'w'
)
as
f_out
:
for
index
,
score
in
relation_paths
:
print
(
'
\t
'
.
join
([
index_to_path
[
index
],
str
(
score
)]),
file
=
f_out
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment