README.md 9.64 KB
Newer Older
1
2
3
4
![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

Chris Waterson's avatar
Chris Waterson committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# LexNET for Noun Compound Relation Classification

This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
algorithm for classifying relationships, specifically applied to classifying the
relationships that hold between noun compounds:

* *olive oil* is oil that is *made from* olives
* *cooking oil* which is oil that is *used for* cooking
* *motor oil* is oil that is *contained in* a motor

The model is a supervised classifier that predicts the relationship that holds
between the constituents of a two-word noun compound using:

1. A neural "paraphrase" of each syntactic dependency path that connects the
   constituents in a large corpus. For example, given a sentence like *This fine
   oil is made from first-press olives*, the dependency path is something like
   `oil <NSUBJPASS made PREP> from POBJ> olive`.
2. The distributional information provided by the individual words; i.e., the
   word embeddings of the two consituents.
3. The distributional signal provided by the compound itself; i.e., the
   embedding of the noun compound in context.

The model includes several variants: *path-based model* uses (1) alone, the
*distributional model* uses (2) alone, and the *integrated model* uses (1) and
(2).  The *distributional-nc model* and the *integrated-nc* model each add (3).

Training a model requires the following:

1. A collection of noun compounds that have been labeled using a *relation
   inventory*.  The inventory describes the specific relationships that you'd
   like the model to differentiate (e.g. *part of* versus *composed of* versus
36
37
38
39
40
41
   *purpose*), and generally may consist of tens of classes.  You can download
   the dataset used in the paper from
   [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz).
2. A collection of word embeddings: the path-based model uses the word
   embeddings as part of the path representation, and the distributional models
   use the word embeddings directly as prediction features.
Chris Waterson's avatar
Chris Waterson committed
42
3. The path-based model requires a collection of syntactic dependency parses
43
44
45
   that connect the constituents for each noun compound. To generate these,
   you'll need a corpus from which to train this data; we used Wikipedia and the
   [LDC GigaWord5](https://catalog.ldc.upenn.edu/LDC2011T07) corpora.
Chris Waterson's avatar
Chris Waterson committed
46
47
48
49
50
51
52
53
54
55
56
57
58

# Contents

The following source code is included here:

* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
  model to predict a noun-compound relationship given labeled noun-compounds and
  dependency parse paths.
* `learn_classifier.py` is a script that trains and evaluates a classifier based
  on any combination of paths, word embeddings, and noun-compound embeddings.
* `get_indicative_paths.py` is a script that generates the most indicative
  syntactic dependency paths for a particular relationship.

59
60
61
62
63
64
65
66
67
68
69
70
71
Also included are utilities for preparing data for training:

* `text_embeddings_to_binary.py` converts a text file containing word embeddings
  into a binary file that is quicker to load.
* `extract_paths.py` finds all the dependency paths that connect words in a
  corpus.
* `sorted_paths_to_examples.py` processes the output of `extract_paths.py` to
  produce summarized training data.

This code (in particular, the utilities used to prepare the data) differs from
the code that was used to prepare data for the paper. Notably, we used a
proprietary dependency parser instead of spaCy, which is used here.

Chris Waterson's avatar
Chris Waterson committed
72
73
74
75
76
77
# Dependencies

* [TensorFlow](http://www.tensorflow.org/): see detailed installation
  instructions at that site.
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
  with `pip install sklearn`.
78
79
* [SpaCy](https://spacy.io/): `pip install spacy` ought to do the trick, along
  with the English model.
Chris Waterson's avatar
Chris Waterson committed
80
81
82

# Creating the Model

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
This sections described the steps necessary to create and evaluate the model
described in the paper.

## Generate Path Data

To begin, you need three text files:

1. **Corpus**. This file should contain natural language sentences, written with
   one sentence per line.  For purposes of exposition, we'll assume that you
   have English Wikipedia serialized this way in `${HOME}/data/wiki.txt`.
2. **Labeled Noun Compound Pairs**.  This file contain (modfier, head, label)
   tuples, tab-separated, with one per line.  The *label* represented the
   relationship between the head and the modifier; e.g., if `purpose` is one
   your labels, you could possibly include `tooth<tab>paste<tab>purpose`.
3. **Word Embeddings**. We used the
   [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings; in
   particular the 6B token, 300d variant.  We'll assume you have this file as
   `${HOME}/data/glove.6B.300d.txt`.

We first processed the embeddings from their text format into something that we
can load a little bit more quickly:

    ./text_embeddings_to_binary.py \
      --input ${HOME}/data/glove.6B.300d.txt \
      --output_vocab ${HOME}/data/vocab.txt \
      --output_npy ${HOME}/data/glove.6B.300d.npy

Next, we'll extract all the dependency parse paths connecting our labeled pairs
from the corpus.  This process takes a *looooong* time, but is trivially
parallelized using map-reduce if you have access to that technology.

    ./extract_paths.py \
      --corpus ${HOME}/data/wiki.txt \
      --labeled_pairs ${HOME}/data/labeled-pairs.tsv \
      --output ${HOME}/data/paths.tsv

The file it produces (`paths.tsv`) is a tab-separated file that contains the
modifier, the head, the label, the encoded path, and the sentence from which the
path was drawn.  (This last is mostly for sanity checking.)  A sample row might
look something like this (where newlines would actually be tab characters):
Chris Waterson's avatar
Chris Waterson committed
123

124
125
126
127
128
    navy
    captain
    owner_emp_use
    <X>/PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::<Y>/PROPN/compound/<
    He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus
Chris Waterson's avatar
Chris Waterson committed
129

130
This file must be sorted as follows:
Chris Waterson's avatar
Chris Waterson committed
131

132
    sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv
Chris Waterson's avatar
Chris Waterson committed
133

134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
In particular, rows with the same modifier, head, and label must appear
contiguously.

We next create a file that contains all the relation labels from our original
labeled pairs:

    awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \
      | sort -u > ${HOME}/data/relations.txt

With these in hand, we're ready to produce the train, validation, and test data:

    ./sorted_paths_to_examples.py \
       --input ${HOME}/data/sorted.paths.tsv \
       --vocab ${HOME}/data/vocab.txt \
       --relations ${HOME}/data/relations.txt \
       --splits ${HOME}/data/splits.txt \
       --output_dir ${HOME}/data

Here, `splits.txt` is a file that indicates which "split" (train, test, or
validation) you want the pair to appear in.  It should be a tab-separate file
which conatins the modifier, head, and the dataset ( `train`, `test`, or `val`)
into which the pair should be placed; e.g.,:

    tooth <TAB> paste <TAB> train
    banana <TAB> seat <TAB> test

The program will produce a separate file for each dataset split in the directory
specified by `--output_dir`.  Each file is contains `tf.train.Example` protocol
buffers encoded using the `TFRecord` file format.
Chris Waterson's avatar
Chris Waterson committed
163
164
165

## Create Path Embeddings

166
Now we're ready to train the path embeddings using `learn_path_embeddings.py`:
Chris Waterson's avatar
Chris Waterson committed
167

168
169
170
171
172
173
174
175
176
177
    ./learn_path_embeddings.py \
        --train ${HOME}/data/train.tfrecs.gz \
        --val ${HOME}/data/val.tfrecs.gz \
        --text ${HOME}/data/test.tfrecs.gz \
        --embeddings ${HOME}/data/glove.6B.300d.npy
        --relations ${HOME}/data/relations.txt
        --output ${HOME}/data/path-embeddings \
        --logdir /tmp/learn_path_embeddings

The path embeddings will be placed at the location specified by `--output`.
Chris Waterson's avatar
Chris Waterson committed
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

## Train classifiers

Train classifiers and evaluate on the validation and test data using
`train_classifiers.py` script.  This shell script fragment will iterate through
each dataset, split, corpus, and model type to train and evaluate classifiers.

    LOGDIR=/tmp/learn_classifier
    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
      for SPLIT in random lexical_head lexical_mod lexical_full ; do
        for CORPUS in wiki_gigiawords ; do
          for MODEL in dist dist-nc path integrated integrated-nc ; do
            # Filename for the log that will contain the classifier results.
            LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
            python learn_classifier.py \
              --dataset_dir ~/lexnet/datasets \
              --dataset "${DATASET}" \
              --corpus "${SPLIT}/${CORPUS}" \
              --embeddings_base_path ~/lexnet/embeddings \
              --logdir ${LOGDIR} \
              --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
          done
        done
      done
    done

The log file will contain the final performance (precision, recall, F1) on the
train, dev, and test sets, and will include a confusion matrix for each.

# Contact

If you have any questions, issues, or suggestions, feel free to contact either
@vered1986 or @waterson.
211
212
213
214
215

If you use this code for any published research, please include the following citation:

Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. 
Vered Shwartz and Chris Waterson. NAACL 2018. [link](https://arxiv.org/pdf/1803.08073.pdf).