README.md 6.25 KB
Newer Older
1
2
3
4
![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

Chris Waterson's avatar
Chris Waterson committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# LexNET for Noun Compound Relation Classification

This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
algorithm for classifying relationships, specifically applied to classifying the
relationships that hold between noun compounds:

* *olive oil* is oil that is *made from* olives
* *cooking oil* which is oil that is *used for* cooking
* *motor oil* is oil that is *contained in* a motor

The model is a supervised classifier that predicts the relationship that holds
between the constituents of a two-word noun compound using:

1. A neural "paraphrase" of each syntactic dependency path that connects the
   constituents in a large corpus. For example, given a sentence like *This fine
   oil is made from first-press olives*, the dependency path is something like
   `oil <NSUBJPASS made PREP> from POBJ> olive`.
2. The distributional information provided by the individual words; i.e., the
   word embeddings of the two consituents.
3. The distributional signal provided by the compound itself; i.e., the
   embedding of the noun compound in context.

The model includes several variants: *path-based model* uses (1) alone, the
*distributional model* uses (2) alone, and the *integrated model* uses (1) and
(2).  The *distributional-nc model* and the *integrated-nc* model each add (3).

Training a model requires the following:

1. A collection of noun compounds that have been labeled using a *relation
   inventory*.  The inventory describes the specific relationships that you'd
   like the model to differentiate (e.g. *part of* versus *composed of* versus
36
37
   *purpose*), and generally may consist of tens of classes. 
   You can download the dataset used in the paper from [here](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz).
Chris Waterson's avatar
Chris Waterson committed
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
2. You'll need a collection of word embeddings: the path-based model uses the
   word embeddings as part of the path representation, and the distributional
   models use the word embeddings directly as prediction features.
3. The path-based model requires a collection of syntactic dependency parses
   that connect the constituents for each noun compound.

At the moment, this repository does not contain the tools for generating this
data, but we will provide references to existing datasets and plan to add tools
to generate the data in the future.

# Contents

The following source code is included here:

* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
  model to predict a noun-compound relationship given labeled noun-compounds and
  dependency parse paths.
* `learn_classifier.py` is a script that trains and evaluates a classifier based
  on any combination of paths, word embeddings, and noun-compound embeddings.
* `get_indicative_paths.py` is a script that generates the most indicative
  syntactic dependency paths for a particular relationship.

# Dependencies

* [TensorFlow](http://www.tensorflow.org/): see detailed installation
  instructions at that site.
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
  with `pip install sklearn`.

# Creating the Model

This section describes the necessary steps that you must follow to reproduce the
results in the paper.

## Generate/Download Path Data

TBD! Our plan is to make the aggregate path data available that was used to
train path embeddings and classifiers; however, this will be released
separately.

## Generate/Download Embedding Data

TBD! While we used the standard Glove vectors for the relata embeddings, the NC
embeddings were generated separately. Our plan is to make that data available,
but it will be released separately.

## Create Path Embeddings

Create the path embeddings using `learn_path_embeddings.py`.  This shell script
fragment will iterate through each dataset, split, and corpus to generate path
embeddings for each.

    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
      for SPLIT in random lexical_head lexical_mod lexical_full ; do
        for CORPUS in wiki_gigiawords ; do
          python learn_path_embeddings.py \
            --dataset_dir ~/lexnet/datasets \
            --dataset "${DATASET}" \
            --corpus "${SPLIT}/${CORPUS}" \
            --embeddings_base_path ~/lexnet/embeddings \
            --logdir /tmp/learn_path_embeddings
        done
      done
    done

The path embeddings will be placed in the directory specified by
`--embeddings_base_path`.

## Train classifiers

Train classifiers and evaluate on the validation and test data using
`train_classifiers.py` script.  This shell script fragment will iterate through
each dataset, split, corpus, and model type to train and evaluate classifiers.

    LOGDIR=/tmp/learn_classifier
    for DATASET in tratz/fine_grained tratz/coarse_grained ; do
      for SPLIT in random lexical_head lexical_mod lexical_full ; do
        for CORPUS in wiki_gigiawords ; do
          for MODEL in dist dist-nc path integrated integrated-nc ; do
            # Filename for the log that will contain the classifier results.
            LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
            python learn_classifier.py \
              --dataset_dir ~/lexnet/datasets \
              --dataset "${DATASET}" \
              --corpus "${SPLIT}/${CORPUS}" \
              --embeddings_base_path ~/lexnet/embeddings \
              --logdir ${LOGDIR} \
              --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
          done
        done
      done
    done

The log file will contain the final performance (precision, recall, F1) on the
train, dev, and test sets, and will include a confusion matrix for each.

# Contact

If you have any questions, issues, or suggestions, feel free to contact either
@vered1986 or @waterson.
138
139
140
141
142

If you use this code for any published research, please include the following citation:

Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. 
Vered Shwartz and Chris Waterson. NAACL 2018. [link](https://arxiv.org/pdf/1803.08073.pdf).