README.md 14.7 KB
Newer Older
1
2
# SyntaxNet: Neural Models of Syntax.

Ivan Bogatyy's avatar
Ivan Bogatyy committed
3
4
*A TensorFlow toolkit for deep learning powered natural language understanding
(NLU).*
5

Ivan Bogatyy's avatar
Ivan Bogatyy committed
6
7
**CoNLL**: See [here](g3doc/conll2017/README.md) for instructions for using the
SyntaxNet/DRAGNN baseline for the CoNLL2017 Shared Task.
8

9
10
11
At Google, we spend a lot of time thinking about how computer systems can read
and understand human language in order to process it in intelligent ways. We are
excited to share the fruits of our research with the broader community by
Ivan Bogatyy's avatar
Ivan Bogatyy committed
12
13
14
15
16
17
18
19
20
21
releasing SyntaxNet, an open-source neural network framework for
[TensorFlow](http://www.tensorflow.org) that provides a foundation for Natural
Language Understanding (NLU) systems. Our release includes all the code needed
to train new SyntaxNet models on your own data, as well as a suite of models
that we have trained for you, and that you can use to analyze text in over 40
languages.

This repository is largely divided into two sub-packages:

1.  **DRAGNN:
22
    [code](https://github.com/tensorflow/models/tree/master/research/syntaxnet/dragnn),
Ivan Bogatyy's avatar
Ivan Bogatyy committed
23
24
25
    [documentation](g3doc/DRAGNN.md),
    [paper](https://arxiv.org/pdf/1703.04474.pdf)** implements Dynamic Recurrent
    Acyclic Graphical Neural Networks (DRAGNN), a framework for building
26
27
    multi-task, fully dynamically constructed computation graphs. Practically,
    we use DRAGNN to extend our prior work from [Andor et al.
Ivan Bogatyy's avatar
Ivan Bogatyy committed
28
    (2016)](http://arxiv.org/abs/1603.06042) with end-to-end, deep recurrent
Ivan Bogatyy's avatar
Ivan Bogatyy committed
29
30
31
32
    models and to provide a much easier to use interface to SyntaxNet. *DRAGNN
    is designed first and foremost as a Python library, and therefore much
    easier to use than the original SyntaxNet implementation.*

Ivan Bogatyy's avatar
Ivan Bogatyy committed
33
1.  **SyntaxNet:
34
    [code](https://github.com/tensorflow/models/tree/master/research/syntaxnet/syntaxnet),
Ivan Bogatyy's avatar
Ivan Bogatyy committed
35
36
37
38
39
40
41
42
43
44
45
46
47
48
    [documentation](g3doc/syntaxnet-tutorial.md)** is a transition-based
    framework for natural language processing, with core functionality for
    feature extraction, representing annotated data, and evaluation. As of the
    DRAGNN release, it is recommended to train and deploy SyntaxNet models using
    the DRAGNN framework.

## How to use this library

There are three ways to use SyntaxNet:

*   See [here](g3doc/conll2017/README.md) for instructions for using the
    SyntaxNet/DRAGNN baseline for the CoNLL2017 Shared Task, and running the
    ParseySaurus models.
*   You can use DRAGNN to train your NLP models for other tasks and dataset. See
Ivan Bogatyy's avatar
Ivan Bogatyy committed
49
    "Getting started with DRAGNN" below.
Ivan Bogatyy's avatar
Ivan Bogatyy committed
50
51
*   You can continue to use the Parsey McParseface family of pre-trained
    SyntaxNet models. See "Pre-trained NLP models" below.
52

Ivan Bogatyy's avatar
Ivan Bogatyy committed
53
## Installation
54

Ivan Bogatyy's avatar
Ivan Bogatyy committed
55
### Docker installation
56

57
58
_This process takes ~10 minutes._

Ivan Bogatyy's avatar
Ivan Bogatyy committed
59
60
61
The simplest way to get started with DRAGNN is by loading our Docker container.
[Here](g3doc/CLOUD.md) is a tutorial for running the DRAGNN container on
[GCP](https://cloud.google.com) (just as applicable to your own computer).
62

Terry Koo's avatar
Terry Koo committed
63
### Ubuntu 16.04+ binary installation
64
65

_This process takes ~5 minutes, but is only compatible with Linux using GNU libc
Terry Koo's avatar
Terry Koo committed
66
3.4.22 and above (e.g. Ubuntu 16.04)._
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

Binary wheel packages are provided for TensorFlow and SyntaxNet. If you do not
need to write new binary TensorFlow ops, these should suffice.

*   `apt-get install -y graphviz libgraphviz-dev libopenblas-base libpng16-16
    libxft2 python-pip python-mock`
*   `pip install pygraphviz
    --install-option="--include-path=/usr/include/graphviz"
    --install-option="--library-path=/usr/lib/graphviz/"`
*   `pip install 'ipython<6.0' protobuf numpy scipy jupyter
    syntaxnet-with-tensorflow`
*   `python -m jupyter_core.command nbextension enable --py --sys-prefix
    widgetsnbextension`

You can test that binary modules can be successfully imported by running,

*   `python -c 'import dragnn.python.load_dragnn_cc_impl,
    syntaxnet.load_parser_ops'`

Ivan Bogatyy's avatar
Ivan Bogatyy committed
86
### Manual installation
87

88
89
_This process takes 1-2 hours._

Ivan Bogatyy's avatar
Ivan Bogatyy committed
90
Running and training SyntaxNet/DRAGNN models requires building this package from
91
92
source. You'll need to install:

93
*   python 2.7:
Ivan Bogatyy's avatar
Ivan Bogatyy committed
94
    *   Python 3 support is not available yet
Terry Koo's avatar
Terry Koo committed
95
*   bazel 0.11.1:
Ivan Bogatyy's avatar
Ivan Bogatyy committed
96
    *   Follow the instructions [here](http://bazel.build/docs/install.html)
Terry Koo's avatar
Terry Koo committed
97
    *   Alternately, Download bazel 0.11.1 <.deb> from
Ivan Bogatyy's avatar
Ivan Bogatyy committed
98
99
        [https://github.com/bazelbuild/bazel/releases](https://github.com/bazelbuild/bazel/releases)
        for your system configuration.
100
101
102
103
104
105
106
    *   Install it using the command: sudo dpkg -i <.deb file>
    *   Check for the bazel version by typing: bazel version
*   swig:
    *   `apt-get install swig` on Ubuntu
    *   `brew install swig` on OSX
*   protocol buffers, with a version supported by TensorFlow:
    *   check your protobuf version with `pip freeze | grep protobuf`
107
    *   upgrade to a supported version with `pip install -U protobuf==3.3.0`
Terry Koo's avatar
Terry Koo committed
108
109
*   autograd, with a version supported by TensorFlow:
    *   `pip install -U autograd==1.1.13`
110
111
*   mock, the testing package:
    *   `pip install mock`
112
113
*   asciitree, to draw parse trees on the console for the demo:
    *   `pip install asciitree`
114
115
*   numpy, package for scientific computing:
    *   `pip install numpy`
Ivan Bogatyy's avatar
Ivan Bogatyy committed
116
117
118
119
120
*   pygraphviz to visualize traces and parse trees:
    *   `apt-get install -y graphviz libgraphviz-dev`
    *   `pip install pygraphviz
        --install-option="--include-path=/usr/include/graphviz"
        --install-option="--library-path=/usr/lib/graphviz/"`
121

122
123
124
125
Once you completed the above steps, you can build and test SyntaxNet with the
following commands:

```shell
126
  git clone --recursive https://github.com/tensorflow/models.git
127
  cd models/research/syntaxnet/tensorflow
128
129
  ./configure
  cd ..
Ivan Bogatyy's avatar
Ivan Bogatyy committed
130
  bazel test ...
131
132
  # On Mac, run the following:
  bazel test --linkopt=-headerpad_max_install_names \
Ivan Bogatyy's avatar
Ivan Bogatyy committed
133
    dragnn/... syntaxnet/... util/utf8/...
134
```
135

136
137
Bazel should complete reporting all tests passed.

138
139
140
Now you can install the SyntaxNet and DRAGNN Python modules with the following
commands:

Tomas Wood's avatar
Tomas Wood committed
141
142
143
144
145
146
147
```shell
  mkdir /tmp/syntaxnet_pkg
  bazel-bin/dragnn/tools/build_pip_package --output-dir=/tmp/syntaxnet_pkg
  #  The filename of the .whl depends on your platform.
  sudo pip install /tmp/syntaxnet_pkg/syntaxnet-x.xx-none-any.whl
```

148
149
150
To build SyntaxNet with GPU support please refer to the instructions in
[issues/248](https://github.com/tensorflow/models/issues/248).

151
152
**Note:** If you are running Docker on OSX, make sure that you have enough
memory allocated for your Docker VM.
153

154
155
## Getting Started

Ivan Bogatyy's avatar
Ivan Bogatyy committed
156
157
158
159
160
161
162
We have a few guides on this README, as well as more extensive
[documentation](g3doc/).

### Learning the DRAGNN framework

![DRAGNN](g3doc/unrolled-dragnn.png)

Ivan Bogatyy's avatar
Ivan Bogatyy committed
163
164
165
166
An easy and visual way to get started with DRAGNN is to run our Jupyter
notebooks for [interactive
debugging](examples/dragnn/interactive_text_analyzer.ipynb) and [training a new
model](examples/dragnn/trainer_tutorial.ipynb). Our tutorial
Ivan Bogatyy's avatar
Ivan Bogatyy committed
167
[here](g3doc/CLOUD.md) explains how to start it up from the Docker container.
Ivan Bogatyy's avatar
Ivan Bogatyy committed
168
169
Once you have DRAGNN installed and running, try out the
[ParseySaurus](g3doc/conll2017) models.
Ivan Bogatyy's avatar
Ivan Bogatyy committed
170
171
172
173
174
175
176
177

### Using the Pre-trained NLP models

We are happy to release *Parsey McParseface*, an English parser that we have
trained for you, and that you can use to analyze English text, along with
[trained models for 40 languages](g3doc/universal.md) and support for text
segmentation and morphological analysis.

178
179
180
181
182
Once you have successfully built SyntaxNet, you can start parsing text right
away with Parsey McParseface, located under `syntaxnet/models`. The easiest
thing is to use or modify the included script `syntaxnet/demo.sh`, which shows a
basic setup to parse English taking plain text as input.

Ivan Bogatyy's avatar
Ivan Bogatyy committed
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
You can also skip right away to the [detailed SyntaxNet
tutorial](g3doc/syntaxnet-tutorial.md).

How accurate is Parsey McParseface? For the initial release, we tried to balance
a model that runs fast enough to be useful on a single machine (e.g. ~600
words/second on a modern desktop) and that is also the most accurate parser
available. Here's how Parsey McParseface compares to the academic literature on
several different English domains: (all numbers are % correct head assignments
in the tree, or unlabelled attachment score)

Model                                                                                                           | News  | Web   | Questions
--------------------------------------------------------------------------------------------------------------- | :---: | :---: | :-------:
[Martins et al. (2013)](http://www.cs.cmu.edu/~ark/TurboParser/)                                                | 93.10 | 88.23 | 94.21
[Zhang and McDonald (2014)](http://research.google.com/pubs/archive/38148.pdf)                                  | 93.32 | 88.65 | 93.37
[Weiss et al. (2015)](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf) | 93.91 | 89.29 | 94.17
[Andor et al. (2016)](http://arxiv.org/abs/1603.06042)*                                                         | 94.44 | 90.17 | 95.40
Parsey McParseface                                                                                              | 94.15 | 89.08 | 94.77

We see that Parsey McParseface is state-of-the-art; more importantly, with
SyntaxNet you can train larger networks with more hidden units and bigger beam
sizes if you want to push the accuracy even further: [Andor et al.
(2016)](http://arxiv.org/abs/1603.06042)* is simply a SyntaxNet model with a
larger beam and network. For futher information on the datasets, see that paper
under the section "Treebank Union".

Parsey McParseface is also state-of-the-art for part-of-speech (POS) tagging
(numbers below are per-token accuracy):

Model                                                                      | News  | Web   | Questions
-------------------------------------------------------------------------- | :---: | :---: | :-------:
[Ling et al. (2015)](http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf) | 97.44 | 94.03 | 96.18
[Andor et al. (2016)](http://arxiv.org/abs/1603.06042)*                    | 97.77 | 94.80 | 96.86
Parsey McParseface                                                         | 97.52 | 94.24 | 96.45

#### Parsing from Standard Input
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

Simply pass one sentence per line of text into the script at
`syntaxnet/demo.sh`. The script will break the text into words, run the POS
tagger, run the parser, and then generate an ASCII version of the parse tree:

```shell
echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh

Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
 +-- Bob NNP nsubj
 +-- pizza NN dobj
 |   +-- the DT det
 +-- to IN prep
 |   +-- Alice NNP pobj
 +-- . . punct
```

The ASCII tree shows the text organized as in the parse, not left-to-right as
visualized in our tutorial graphs. In this example, we see that the verb
"brought" is the root of the sentence, with the subject "Bob", the object
"pizza", and the prepositional phrase "to Alice".

If you want to feed in tokenized, CONLL-formatted text, you can run `demo.sh
--conll`.

Ivan Bogatyy's avatar
Ivan Bogatyy committed
245
#### Annotating a Corpus
246
247
248
249
250

To change the pipeline to read and write to specific files (as opposed to piping
through stdin and stdout), we have to modify the `demo.sh` to point to the files
we want. The SyntaxNet models are configured via a combination of run-time flags
(which are easy to change) and a text format `TaskSpec` protocol buffer. The
251
252
spec file used in the demo is in
`syntaxnet/models/parsey_mcparseface/context.pbtxt`.
253
254
255

To use corpora instead of stdin/stdout, we have to:

256
1.  Create or modify an `input` field inside the `TaskSpec`, with the
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
    `file_pattern` specifying the location we want. If the input corpus is in
    CONLL format, make sure to put `record_format: 'conll-sentence'`.
1.  Change the `--input` and/or `--output` flag to use the name of the resource
    as the output, instead of `stdin` and `stdout`.

E.g., if we wanted to POS tag the CONLL corpus `./wsj.conll`, we would create
two entries, one for the input and one for the output:

```protosame
input {
  name: 'wsj-data'
  record_format: 'conll-sentence'
  Part {
    file_pattern: './wsj.conll'
  }
}
input {
  name: 'wsj-data-tagged'
  record_format: 'conll-sentence'
  Part {
    file_pattern: './wsj-tagged.conll'
  }
}
```

Then we can use `--input=wsj-data --output=wsj-data-tagged` on the command line
to specify reading and writing to these files.

Ivan Bogatyy's avatar
Ivan Bogatyy committed
285
#### Configuring the Python Scripts
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306

As mentioned above, the python scripts are configured in two ways:

1.  **Run-time flags** are used to point to the `TaskSpec` file, switch between
    inputs for reading and writing, and set various run-time model parameters.
    At training time, these flags are used to set the learning rate, hidden
    layer sizes, and other key parameters.
1.  The **`TaskSpec` proto** stores configuration about the transition system,
    the features, and a set of named static resources required by the parser. It
    is specified via the `--task_context` flag. A few key notes to remember:

    -   The `Parameter` settings in the `TaskSpec` have a prefix: either
        `brain_pos` (they apply to the tagger) or `brain_parser` (they apply to
        the parser). The `--prefix` run-time flag switches between reading from
        the two configurations.
    -   The resources will be created and/or modified during multiple stages of
        training. As described above, the resources can also be used at
        evaluation time to read or write to specific files. These resources are
        also separate from the model parameters, which are saved separately via
        calls to TensorFlow ops, and loaded via the `--model_path` flag.
    -   Because the `TaskSpec` contains file path, remember that copying around
Angus McLeod's avatar
typo  
Angus McLeod committed
307
        this file is not enough to relocate a trained model: you need to move
308
309
310
311
312
313
314
315
316
317
318
319
320
        and update all the paths as well.

Note that some run-time flags need to be consistent between training and testing
(e.g. the number of hidden units).

### Next Steps

There are many ways to extend this framework, e.g. adding new features, changing
the model structure, training on other languages, etc. We suggest reading the
detailed tutorial below to get a handle on the rest of the framework.

## Contact

321
To ask questions or report issues please post on Stack Overflow with the tag
Ivan Bogatyy's avatar
Ivan Bogatyy committed
322
323
324
325
[syntaxnet](http://stackoverflow.com/questions/tagged/syntaxnet) or open an
issue on the tensorflow/models [issues
tracker](https://github.com/tensorflow/models/issues). Please assign SyntaxNet
issues to @calberti or @andorardo.
326
327
328
329
330

## Credits

Original authors of the code in this package include (in alphabetical order):

331
332
333
334
*   Alessandro Presta
*   Aliaksei Severyn
*   Andy Golding
*   Bernd Bohnet
Ivan Bogatyy's avatar
Ivan Bogatyy committed
335
*   Chayut Thanapirom
336
337
338
339
340
*   Chris Alberti
*   Daniel Andor
*   David Weiss
*   Emily Pitler
*   Greg Coppola
341
*   Ivan Bogatyy
342
*   Ji Ma
343
344
*   Keith Hall
*   Kuzman Ganchev
Ivan Bogatyy's avatar
Ivan Bogatyy committed
345
*   Lingpeng Kong
346
*   Livio Baldini Soares
Ivan Bogatyy's avatar
Ivan Bogatyy committed
347
*   Mark Omernick
348
349
350
351
352
353
354
*   Michael Collins
*   Michael Ringgaard
*   Ryan McDonald
*   Slav Petrov
*   Stefan Istrate
*   Terry Koo
*   Tim Credo
Ivan Bogatyy's avatar
Ivan Bogatyy committed
355
*   Zora Tung