Commit 32ab5a58 authored by calberti's avatar calberti Committed by Martin Wicke
Browse files

Adding SyntaxNet to tensorflow/models (#63)

parent 148a15fb
autoencoder/MNIST_data/*
*.pyc
[submodule "tensorflow"]
path = syntaxnet/tensorflow
url = https://github.com/tensorflow/tensorflow.git
/bazel-bin
/bazel-genfiles
/bazel-out
/bazel-tensorflow
/bazel-testlogs
/bazel-tf
/bazel-syntaxnet
# SyntaxNet: Neural Models of Syntax.
*A TensorFlow implementation of the models described in [Andor et al. (2016)]
(http://arxiv.org/pdf/1603.06042v1.pdf).*
At Google, we spend a lot of time thinking about how computer systems can read
and understand human language in order to process it in intelligent ways. We are
excited to share the fruits of our research with the broader community by
releasing SyntaxNet, an open-source neural network framework for [TensorFlow]
(http://www.tensorflow.org) that provides a foundation for Natural Language
Understanding (NLU) systems. Our release includes all the code needed to train
new SyntaxNet models on your own data, as well as *Parsey McParseface*, an
English parser that we have trained for you, and that you can use to analyze
English text.
So, how accurate is Parsey McParseface? For this release, we tried to balance a
model that runs fast enough to be useful on a single machine (e.g. ~600
words/second on a modern desktop) and that is also the most accurate parser
available. Here's how Parsey McParseface compares to the academic literature on
several different English domains: (all numbers are % correct head assignments
in the tree, or unlabelled attachment score)
Model | News | Web | Questions
--------------------------------------------------------------------------------------------------------------- | :---: | :---: | :-------:
[Martins et al. (2013)](http://www.cs.cmu.edu/~ark/TurboParser/) | 93.10 | 88.23 | 94.21
[Zhang and McDonald (2014)](http://research.google.com/pubs/archive/38148.pdf) | 93.32 | 88.65 | 93.37
[Weiss et al. (2015)](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf) | 93.91 | 89.29 | 94.17
[Andor et al. (2016)](http://arxiv.org/pdf/1603.06042v1.pdf)* | 94.44 | 90.17 | 95.40
Parsey McParseface | 94.15 | 89.08 | 94.77
We see that Parsey McParseface is state-of-the-art; more importantly, with
SyntaxNet you can train larger networks with more hidden units and bigger beam
sizes if you want to push the accuracy even further: [Andor et al. (2016)]
(http://arxiv.org/pdf/1603.06042v1.pdf)* is simply a SyntaxNet model with a
larger beam and network. For futher information on the datasets, see that paper
under the section "Treebank Union".
Parsey McParseface is also state-of-the-art for part-of-speech (POS) tagging
(numbers below are per-token accuracy):
Model | News | Web | Questions
-------------------------------------------------------------------------- | :---: | :---: | :-------:
[Ling et al. (2015)](http://www.cs.cmu.edu/~lingwang/papers/emnlp2015.pdf) | 97.78 | 94.03 | 96.18
[Andor et al. (2016)](http://arxiv.org/pdf/1603.06042v1.pdf)* | 97.77 | 94.80 | 96.86
Parsey McParseface | 97.52 | 94.24 | 96.45
The first part of this tutorial describes how to install the necessary tools and
use the already trained models provided in this release. In the second part of
the tutorial we provide more background about the models, as well as
instructions for training models on other datasets.
## Contents
* [Installation](#installation)
* [Getting Started](#getting-started)
* [Parsing from Standard Input](#parsing-from-standard-input)
* [Annotating a Corpus](#annotating-a-corpus)
* [Configuring the Python Scripts](#configuring-the-python-scripts)
* [Next Steps](#next-steps)
* [Detailed Tutorial: Building an NLP Pipeline with SyntaxNet](#detailed-tutorial-building-an-nlp-pipeline-with-syntaxnet)
* [Obtaining Data](#obtaining-data)
* [Part-of-Speech Tagging](#part-of-speech-tagging)
* [Training the SyntaxNet POS Tagger](#training-the-syntaxnet-pos-tagger)
* [Preprocessing with the Tagger](#preprocessing-with-the-tagger)
* [Dependency Parsing: Transition-Based Parsing](#dependency-parsing-transition-based-parsing)
* [Training a Parser Step 1: Local Pretraining](#training-a-parser-step-1-local-pretraining)
* [Training a Parser Step 2: Global Training](#training-a-parser-step-2-global-training)
* [Contact](#contact)
* [Credits](#credits)
## Installation
Running and training SyntaxNet models requires building this package from
source. You'll need to install:
* bazel:
* follow the instructions [here](http://bazel.io/docs/install.html)
* **Note: You must use bazel version 0.2.2, NOT 0.2.2b, due to a WORKSPACE
issue**
* swig:
* `apt-get install swig` on Ubuntu
* `brew install swig` on OSX
* protocol buffers, with a version supported by TensorFlow:
* check your protobuf version with `pip freeze | grep protobuf1`
* upgrade to a supported version with `pip install -U protobuf==3.0.0b2`
* asciitree, to draw parse trees on the console for the demo:
* `pip install asciitree`
Once you completed the above steps, you can build and test SyntaxNet with the
following commands:
```shell
git clone --recursive https://github.com/tensorflow/models.git
cd models/syntaxnet/tensorflow
./configure
cd ..
bazel test syntaxnet/... util/utf8/...
# On Mac, run the following:
bazel test --linkopt=-headerpad_max_install_names \
syntaxnet/... util/utf8/...
```
Bazel should complete reporting all tests passed.
## Getting Started
Once you have successfully built SyntaxNet, you can start parsing text right
away with Parsey McParseface, located under `syntaxnet/models`. The easiest
thing is to use or modify the included script `syntaxnet/demo.sh`, which shows a
basic setup to parse English taking plain text as input.
### Parsing from Standard Input
Simply pass one sentence per line of text into the script at
`syntaxnet/demo.sh`. The script will break the text into words, run the POS
tagger, run the parser, and then generate an ASCII version of the parse tree:
```shell
echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh
Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
+-- Bob NNP nsubj
+-- pizza NN dobj
| +-- the DT det
+-- to IN prep
| +-- Alice NNP pobj
+-- . . punct
```
The ASCII tree shows the text organized as in the parse, not left-to-right as
visualized in our tutorial graphs. In this example, we see that the verb
"brought" is the root of the sentence, with the subject "Bob", the object
"pizza", and the prepositional phrase "to Alice".
If you want to feed in tokenized, CONLL-formatted text, you can run `demo.sh
--conll`.
### Annotating a Corpus
To change the pipeline to read and write to specific files (as opposed to piping
through stdin and stdout), we have to modify the `demo.sh` to point to the files
we want. The SyntaxNet models are configured via a combination of run-time flags
(which are easy to change) and a text format `TaskSpec` protocol buffer. The
spec file used in the demo is in `syntaxnet/models/treebank_union/context`.
To use corpora instead of stdin/stdout, we have to:
1. Create or modify a `input` field inside the `TaskSpec`, with the
`file_pattern` specifying the location we want. If the input corpus is in
CONLL format, make sure to put `record_format: 'conll-sentence'`.
1. Change the `--input` and/or `--output` flag to use the name of the resource
as the output, instead of `stdin` and `stdout`.
E.g., if we wanted to POS tag the CONLL corpus `./wsj.conll`, we would create
two entries, one for the input and one for the output:
```protosame
input {
name: 'wsj-data'
record_format: 'conll-sentence'
Part {
file_pattern: './wsj.conll'
}
}
input {
name: 'wsj-data-tagged'
record_format: 'conll-sentence'
Part {
file_pattern: './wsj-tagged.conll'
}
}
```
Then we can use `--input=wsj-data --output=wsj-data-tagged` on the command line
to specify reading and writing to these files.
### Configuring the Python Scripts
As mentioned above, the python scripts are configured in two ways:
1. **Run-time flags** are used to point to the `TaskSpec` file, switch between
inputs for reading and writing, and set various run-time model parameters.
At training time, these flags are used to set the learning rate, hidden
layer sizes, and other key parameters.
1. The **`TaskSpec` proto** stores configuration about the transition system,
the features, and a set of named static resources required by the parser. It
is specified via the `--task_context` flag. A few key notes to remember:
- The `Parameter` settings in the `TaskSpec` have a prefix: either
`brain_pos` (they apply to the tagger) or `brain_parser` (they apply to
the parser). The `--prefix` run-time flag switches between reading from
the two configurations.
- The resources will be created and/or modified during multiple stages of
training. As described above, the resources can also be used at
evaluation time to read or write to specific files. These resources are
also separate from the model parameters, which are saved separately via
calls to TensorFlow ops, and loaded via the `--model_path` flag.
- Because the `TaskSpec` contains file path, remember that copying around
this file is not enough to relocate a trained model: you need up move
and update all the paths as well.
Note that some run-time flags need to be consistent between training and testing
(e.g. the number of hidden units).
### Next Steps
There are many ways to extend this framework, e.g. adding new features, changing
the model structure, training on other languages, etc. We suggest reading the
detailed tutorial below to get a handle on the rest of the framework.
## Detailed Tutorial: Building an NLP Pipeline with SyntaxNet
In this tutorial, we'll go over how to train new models, and explain in a bit
more technical detail the NLP side of the models. Our goal here is to explain
the NLP pipeline produced by this package.
### Obtaining Data
The included English parser, Parsey McParseface, was trained on the the standard
corpora of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) and
[OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19), as well as the [English
Web Treebank](https://catalog.ldc.upenn.edu/LDC2012T13), but these are
unfortunately not freely available.
However, the [Universal Dependencies](http://universaldependencies.org/) project
provides freely available treebank data in a number of languages. SyntaxNet can
be trained and evaluated on any of these corpora.
### Part-of-Speech Tagging
Consider the following sentence, which exhibits several ambiguities that affect
its interpretation:
> I saw the man with glasses.
This sentence is composed of words: strings of characters that are segmented
into groups (e.g. "I", "saw", etc.) Each word in the sentence has a *grammatical
function* that can be useful for understanding the meaning of language. For
example, "saw" in this example is a past tense of the verb "to see". But any
given word might have different meanings in different contexts: "saw" could just
as well be a noun (e.g., a saw used for cutting) or a present tense verb (using
a saw to cut something).
A logical first step in understanding language is figuring out these roles for
each word in the sentence. This process is called *Part-of-Speech (POS)
Tagging*. The roles are called POS tags. Although a given word might have
multiple possible tags depending on the context, given any one interpretation of
a sentence each word will generally only have one tag.
One interesting challenge of POS tagging is that the problem of defining a
vocabulary of POS tags for a given language is quite involved. While the concept
of nouns and verbs is pretty common, it has been traditionally difficult to
agree on a standard set of roles across all languages. The [Universal
Dependencies](http://www.universaldependencies.org) project aims to solve this
problem.
### Training the SyntaxNet POS Tagger
In general, determining the correct POS tag requires understanding the entire
sentence and the context in which it is uttered. In practice, we can do very
well just by considering a small window of words around the word of interest.
For example, words that follow the word ‘the’ tend to be adjectives or nouns,
rather than verbs.
To predict POS tags, we use a simple setup. We processes the sentences
left-to-right. For any given word, we extract features of that word and a window
around it, and use these as inputs to a feed-forward neural network classifier,
which predicts a probability distribution over POS tags. Because we make
decisions in left-to-right order, we also use prior decisions as features in
subsequent ones (e.g. "the previous predicted tag was a noun.").
All the models in this package use a flexible markup language to define
features. For example, the features in the POS tagger are found in the
`brain_pos_features` parameter in the `TaskSpec`, and look like this (modulo
spacing):
```
stack(3).word stack(2).word stack(1).word stack.word input.word input(1).word input(2).word input(3).word;
input.digit input.hyphen;
stack.suffix(length=2) input.suffix(length=2) input(1).suffix(length=2);
stack.prefix(length=2) input.prefix(length=2) input(1).prefix(length=2)
```
Note that `stack` here means "words we have already tagged." Thus, this feature
spec uses three types of features: words, suffixes, and prefixes. The features
are grouped into blocks that share an embedding matrix, concatenated together,
and fed into a chain of hidden layers. This structure is based upon the model
proposed by [Chen and Manning (2014)]
(http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf).
We show this layout in the schematic below: the state of the system (a stack and
a buffer, visualized below for both the POS and the dependency parsing task) is
used to extract sparse features, which are fed into the network in groups. We
show only a small subset of the features to simplify the presentation in the
schematic:
![Schematic](ff_nn_schematic.png "Feed-forward Network Structure")
In the configuration above, each block gets its own embedding matrix and the
blocks in the configuration above are delineated with a semi-colon. The
dimensions of each block are controlled in the `brain_pos_embedding_dims`
parameter. **Important note:** unlike many simple NLP models, this is *not* a
bag of words model. Remember that although certain features share embedding
matrices, the above features will be concatenated, so the interpretation of
`input.word` will be quite different from `input(1).word`. This also means that
adding features increases the dimension of the `concat` layer of the model as
well as the number of parameters for the first hidden layer.
To train the model, first edit `syntaxnet/context.pbtxt` so that the inputs
`training-corpus`, `tuning-corpus`, and `dev-corpus` point to the location of
your training data. You can then train a part-of-speech tagger with:
```shell
bazel-bin/syntaxnet/parser_trainer \
--task_context=syntaxnet/context.pbtxt \
--arg_prefix=brain_pos \ # read from POS configuration
--compute_lexicon \ # required for first stage of pipeline
--graph_builder=greedy \ # no beam search
--training_corpus=training-corpus \ # names of training/tuning set
--tuning_corpus=tuning-corpus \
--output_path=models \ # where to save new resources
--batch_size=32 \ # Hyper-parameters
--decay_steps=3600 \
--hidden_layer_sizes=128 \
--learning_rate=0.08 \
--momentum=0.9 \
--seed=0 \
--params=128-0.08-3600-0.9-0 # name for these parameters
```
This will read in the data, construct a lexicon, build a tensorflow graph for
the model with the specific hyperparameters, and train the model. Every so often
the model will be evaluated on the tuning set, and only the checkpoint with the
highest accuracy on this set will be saved. **Note that you should never use a
corpus you intend to test your model on as your tuning set, as you will inflate
your test set results.**
For best results, you should repeat this command with at least 3 different
seeds, and possibly with a few different values for `--learning_rate` and
`--decay_steps`. Good values for `--learning_rate` are usually close to 0.1, and
you usually want `--decay_steps` to correspond to about one tenth of your
corpus. The `--params` flag is only a human readable identifier for the model
being trained, used to construct the full output path, so that you don't need to
worry about clobbering old models by accident.
The `--arg_prefix` flag controls which parameters should be read from the task
context file `context.pbtxt`. In this case `arg_prefix` is set to `brain_pos`,
so the paramters being used in this training run are
`brain_pos_transition_system`, `brain_pos_embedding_dims`, `brain_pos_features`
and, `brain_pos_embedding_names`. To train the dependency parser later
`arg_prefix` will be set to `brain_parser`.
### Preprocessing with the Tagger
Now that we have a trained POS tagging model, we want to use the output of this
model as features in the parser. Thus the next step is to run the trained model
over our training, tuning, and dev (evaluation) sets. We can use the
parser_eval.py` script for this.
For example, the model `128-0.08-3600-0.9-0` trained above can be run over the
training, tuning, and dev sets with the following command:
```shell
PARAMS=128-0.08-3600-0.9-0
for SET in training tuning dev; do
bazel-bin/syntaxnet/parser_eval \
--task_context=models/brain_pos/greedy/$PARAMS/context \
--hidden_layer_sizes=128 \
--input=$SET-corpus \
--output=tagged-$SET-corpus \
--arg_prefix=brain_pos \
--graph_builder=greedy \
--model_path=models/brain_pos/greedy/$PARAMS/model
done
```
**Important note:** This command only works because we have created entries for
you in `context.pbtxt` that correspond to `tagged-training-corpus`,
`tagged-dev-corpus`, and `tagged-tuning-corpus`. From these default settings,
the above will write tagged versions of the training, tuning, and dev set to the
directory `models/brain_pos/greedy/$PARAMS/`. This location is chosen because
the `input` entries do not have `file_pattern` set: instead, they have `creator:
brain_pos/greedy`, which means that `parser_trainer.py` will construct *new*
files when called with `--arg_prefix=brain_pos --graph_builder=greedy` using the
`--model_path` flag to determine the location.
For convenience, `parser_eval.py` also logs POS tagging accuracy after the
output tagged datasets have been written.
### Dependency Parsing: Transition-Based Parsing
Now that we have a prediction for the grammatical role of the words, we want to
understand how the words in the sentence relate to each other. This parser is
built around the *head-modifier* construction: for each word, we choose a
*syntactic head* that it modifies according to some grammatical role.
An example for the above sentence is as follows:
![Figure](sawman.png)
Below each word in the sentence we see both a fine-grained part-of-speech
(*PRP*, *VBD*, *DT*, *NN* etc.), and a coarse-grained part-of-speech (*PRON*,
*VERB*, *DET*, *NOUN*, etc.). Coarse-grained POS tags encode basic grammatical
categories, while the fine-grained POS tags make further distinctions: for
example *NN* is a singular noun (as opposed, for example, to *NNS*, which is a
plural noun), and *VBD* is a past-tense verb. For more discussion see [Petrov et
al. (2012)](http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf).
Crucially, we also see directed arcs signifying grammatical relationships
between different words in the sentence. For example *I* is the subject of
*saw*, as signified by the directed arc labeled *nsubj* between these words;
*man* is the direct object (dobj) of *saw*; the preposition *with* modifies
*man* with a prep relation, signifiying modification by a prepositional phrase;
and so on. In addition the verb *saw* is identified as the *root* of the entire
sentence.
Whenever we have a directed arc between two words, we refer to the word at the
start of the arc as the *head*, and the word at the end of the arc as the
*modifier*. For example we have one arc where the head is *saw* and the modifier
is *I*, another where the head is *saw* and the modifier is *man*, and so on.
The grammatical relationships encoded in dependency structures are directly
related to the underlying meaning of the sentence in question. They allow us to
easily recover the answers to various questions, for example *whom did I see?*,
*who saw the man with glasses?*, and so on.
SyntaxNet is a **transition-based** dependency parser [Nivre (2007)]
(http://www.mitpressjournals.org/doi/pdfplus/10.1162/coli.07-056-R1-07-027) that
constructs a parse incrementally. Like the tagger, it processes words
left-to-right. The words all start as unprocessed input, called the *buffer*. As
words are encountered they are put onto a *stack*. At each step, the parser can
do one of three things:
1. **SHIFT:** Push another word onto the top of the stack, i.e. shifting one
token from the buffer to the stack.
1. **LEFT_ARC:** Pop the top two words from the stack. Attach the second to the
first, creating an arc pointing to the **left**. Push the **first** word
back on the stack.
1. **RIGHT_ARC:** Pop the top two words from the stack. Attach the second to
the first, creating an arc point to the **right**. Push the **second** word
back on the stack.
At each step, we call the combination of the stack and the buffer the
*configuration* of the parser. For the left and right actions, we also assign a
dependency relation label to that arc. This process is visualized in the
following animation for a short sentence:
![Animation](looping-parser.gif "Parsing in Action")
Note that this parser is following a sequence of actions, called a
**derivation**, to produce a "gold" tree labeled by a linguist. We can use this
sequence of decisions to learn a classifier that takes a configuration and
predicts the next action to take.
### Training a Parser Step 1: Local Pretraining
As described in our [paper](http://arxiv.org/pdf/1603.06042v1.pdf), the first
step in training the model is to *pre-train* using *local* decisions. In this
phase, we use the gold dependency to guide the parser, and train a softmax layer
to predict the correct action given these gold dependencies. This can be
performed very efficiently, since the parser's decisions are all independent in
this setting.
Once the tagged datasets are available, a locally normalized dependency parsing
model can be trained with the following command:
```shell
bazel-bin/syntaxnet/parser_trainer \
--arg_prefix=brain_parser \
--batch_size=32 \
--projectivize_training_set \
--decay_steps=4400 \
--graph_builder=greedy \
--hidden_layer_sizes=200,200 \
--learning_rate=0.08 \
--momentum=0.85 \
--output_path=models \
--task_context=models/brain_pos/greedy/$PARAMS/context \
--seed=4 \
--training_corpus=tagged-training-corpus \
--tuning_corpus=tagged-tuning-corpus \
--params=200x200-0.08-4400-0.85-4
```
Note that we point the trainer to the context corresponding to the POS tagger
that we picked previously. This allows the parser to reuse the lexicons and the
tagged datasets that were created in the previous steps. Processing data can be
done similarly to how tagging was done above. For example if in this case we
picked parameters `200x200-0.08-4400-0.85-4`, the training, tuning and dev sets
can be parsed with the following command:
```shell
PARAMS=200x200-0.08-4400-0.85-4
for SET in training tuning dev; do
bazel-bin/syntaxnet/parser_eval \
--task_context=models/brain_parser/greedy/$PARAMS/context \
--hidden_layer_sizes=200,200 \
--input=tagged-$SET-corpus \
--output=parsed-$SET-corpus \
--arg_prefix=brain_parser \
--graph_builder=greedy \
--model_path=models/brain_parser/greedy/$PARAMS/model
done
```
### Training a Parser Step 2: Global Training
As we describe in the paper, there are several problems with the locally
normalized models we just trained. The most important is the *label-bias*
problem: the model doesn't learn what a good parse looks like, only what action
to take given a history of gold decisions. This is because the scores are
normalized *locally* using a softmax for each decision.
In the paper, we show how we can achieve much better results using a *globally*
normalized model: in this model, the softmax scores are summed in log space, and
the scores are not normalized until we reach a final decision. When the parser
stops, the scores of each hypothesis are normalized against a small set of
possible parses (in the case of this model, a beam size of 8). When training, we
force the parser to stop during parsing when the gold derivation falls off the
beam (a strategy known as early-updates).
We give a simplified view of how this training works for a [garden path
sentence](https://en.wikipedia.org/wiki/Garden_path_sentence), where it is
important to maintain multiple hypotheses. A single mistake early on in parsing
leads to a completely incorrect parse; after training, the model learns to
prefer the second (correct) parse.
![Beam search training](beam_search_training.png)
Parsey McParseface correctly parses this sentence. Even though the correct parse
is initially ranked 4th out of multiple hypotheses, when the end of the garden
path is reached, Parsey McParseface can recover due to the beam; using a larger
beam will get a more accurate model, but it will be slower (we used beam 32 for
the models in the paper).
Once you have the pre-trained locally normalized model, a globally normalized
parsing model can now be trained with the following command:
```shell
bazel-bin/syntaxnet/parser_trainer \
--arg_prefix=brain_parser \
--batch_size=8 \
--decay_steps=100 \
--graph_builder=structured \
--hidden_layer_sizes=200,200 \
--learning_rate=0.02 \
--momentum=0.9 \
--output_path=models \
--task_context=models/brain_parser/greedy/$PARAMS/context \
--seed=0 \
--training_corpus=projectivized-training-corpus \
--tuning_corpus=tagged-tuning-corpus \
--params=200x200-0.02-100-0.9-0 \
--pretrained_params=models/brain_parser/greedy/$PARAMS/model \
--pretrained_params_names=\
embedding_matrix_0,embedding_matrix_1,embedding_matrix_2,\
bias_0,weights_0,bias_1,weights_1
```
Training a beam model with the structured builder will take a lot longer than
the greedy training runs above, perhaps 3 or 4 times longer. Note once again
that multiple restarts of training will yield the most reliable results.
Evaluation can again be done with `parser_eval.py`. In this case we use
parameters `200x200-0.02-100-0.9-0` to evaluate on the training, tuning and dev
sets with the following command:
```shell
PARAMS=200x200-0.02-100-0.9-0
for SET in training tuning dev; do
bazel-bin/syntaxnet/parser_eval \
--task_context=models/brain_parser/structured/$PARAMS/context \
--hidden_layer_sizes=200,200 \
--input=tagged-$SET-corpus \
--output=beam-parsed-$SET-corpus \
--arg_prefix=brain_parser \
--graph_builder=structured \
--model_path=models/brain_parser/structured/$PARAMS/model
done
```
Hooray! You now have your very own cousin of Parsey McParseface, ready to go out
and parse text in the wild.
## Contact
To ask questions or report issues please contact syntaxnet-users@google.com.
## Credits
Original authors of the code in this package include (in alphabetical order):
* apresta@google.com (Alessandro Presta)
* bohnetbd@google.com (Bernd Bohnet)
* chrisalberti@google.com (Chris Alberti)
* credo@google.com (Tim Credo)
* danielandor@google.com (Daniel Andor)
* djweiss@google.com (David Weiss)
* epitler@google.com (Emily Pitler)
* gcoppola@google.com (Greg Coppola)
* golding@google.com (Andy Golding)
* istefan@google.com (Stefan Istrate)
* kbhall@google.com (Keith Hall)
* kuzman@google.com (Kuzman Ganchev)
* mjcollins@google.com (Michael Collins)
* ringgaard@google.com (Michael Ringgaard)
* ryanmcd@google.com (Ryan McDonald)
* severyn@google.com (Aliaksei Severyn)
* slav@google.com (Slav Petrov)
* terrykoo@google.com (Terry Koo)
local_repository(
name = "tf",
path = __workspace_dir__ + "/tensorflow",
)
load('//tensorflow/tensorflow:workspace.bzl', 'tf_workspace')
tf_workspace("tensorflow/", "@tf")
# Specify the minimum required Bazel version.
load("@tf//tensorflow:tensorflow.bzl", "check_version")
check_version("0.2.0")
# ===== gRPC dependencies =====
bind(
name = "libssl",
actual = "@boringssl_git//:ssl",
)
git_repository(
name = "boringssl_git",
commit = "436432d849b83ab90f18773e4ae1c7a8f148f48d",
init_submodules = True,
remote = "https://github.com/mdsteele/boringssl-bazel.git",
)
bind(
name = "zlib",
actual = "@zlib_archive//:zlib",
)
new_http_archive(
name = "zlib_archive",
build_file = "zlib.BUILD",
sha256 = "879d73d8cd4d155f31c1f04838ecd567d34bebda780156f0e82a20721b3973d5",
strip_prefix = "zlib-1.2.8",
url = "http://zlib.net/zlib128.zip",
)
# Description:
# A syntactic parser and part-of-speech tagger in TensorFlow.
package(
default_visibility = ["//visibility:private"],
features = ["-layering_check"],
)
licenses(["notice"]) # Apache 2.0
load(
"syntaxnet",
"tf_proto_library",
"tf_proto_library_py",
"tf_gen_op_libs",
"tf_gen_op_wrapper_py",
)
# proto libraries
tf_proto_library(
name = "feature_extractor_proto",
srcs = ["feature_extractor.proto"],
)
tf_proto_library(
name = "sentence_proto",
srcs = ["sentence.proto"],
)
tf_proto_library_py(
name = "sentence_py_pb2",
srcs = ["sentence.proto"],
)
tf_proto_library(
name = "dictionary_proto",
srcs = ["dictionary.proto"],
)
tf_proto_library_py(
name = "dictionary_py_pb2",
srcs = ["dictionary.proto"],
)
tf_proto_library(
name = "kbest_syntax_proto",
srcs = ["kbest_syntax.proto"],
deps = [":sentence_proto"],
)
tf_proto_library(
name = "task_spec_proto",
srcs = ["task_spec.proto"],
)
tf_proto_library_py(
name = "task_spec_py_pb2",
srcs = ["task_spec.proto"],
)
tf_proto_library(
name = "sparse_proto",
srcs = ["sparse.proto"],
)
tf_proto_library_py(
name = "sparse_py_pb2",
srcs = ["sparse.proto"],
)
# cc libraries for feature extraction and parsing
cc_library(
name = "base",
hdrs = ["base.h"],
visibility = ["//visibility:public"],
deps = [
"@re2//:re2",
"@tf//google/protobuf",
"@tf//third_party/eigen3",
] + select({
"//conditions:default": [
"@tf//tensorflow/core:framework",
"@tf//tensorflow/core:lib",
],
"@tf//tensorflow:darwin": [
"@tf//tensorflow/core:framework_headers_lib",
],
}),
)
cc_library(
name = "utils",
srcs = ["utils.cc"],
hdrs = [
"utils.h",
],
deps = [
":base",
"//util/utf8:unicodetext",
],
)
cc_library(
name = "test_main",
testonly = 1,
srcs = ["test_main.cc"],
linkopts = ["-lm"],
deps = [
"@tf//tensorflow/core:lib",
"@tf//tensorflow/core:testlib",
"//external:gtest",
],
)
cc_library(
name = "document_format",
srcs = ["document_format.cc"],
hdrs = ["document_format.h"],
deps = [
":registry",
":sentence_proto",
":task_context",
],
)
cc_library(
name = "text_formats",
srcs = ["text_formats.cc"],
deps = [
":document_format",
],
alwayslink = 1,
)
cc_library(
name = "fml_parser",
srcs = ["fml_parser.cc"],
hdrs = ["fml_parser.h"],
deps = [
":feature_extractor_proto",
":utils",
],
)
cc_library(
name = "proto_io",
hdrs = ["proto_io.h"],
deps = [
":feature_extractor_proto",
":fml_parser",
":kbest_syntax_proto",
":sentence_proto",
":task_context",
],
)
cc_library(
name = "feature_extractor",
srcs = ["feature_extractor.cc"],
hdrs = [
"feature_extractor.h",
"feature_types.h",
],
deps = [
":document_format",
":feature_extractor_proto",
":kbest_syntax_proto",
":proto_io",
":sentence_proto",
":task_context",
":utils",
":workspace",
],
)
cc_library(
name = "affix",
srcs = ["affix.cc"],
hdrs = ["affix.h"],
deps = [
":dictionary_proto",
":feature_extractor",
":shared_store",
":term_frequency_map",
":utils",
":workspace",
],
)
cc_library(
name = "sentence_features",
srcs = ["sentence_features.cc"],
hdrs = ["sentence_features.h"],
deps = [
":affix",
":feature_extractor",
":registry",
],
)
cc_library(
name = "shared_store",
srcs = ["shared_store.cc"],
hdrs = ["shared_store.h"],
deps = [
":utils",
],
)
cc_library(
name = "registry",
srcs = ["registry.cc"],
hdrs = ["registry.h"],
deps = [
":utils",
],
)
cc_library(
name = "workspace",
srcs = ["workspace.cc"],
hdrs = ["workspace.h"],
deps = [
":utils",
],
)
cc_library(
name = "task_context",
srcs = ["task_context.cc"],
hdrs = ["task_context.h"],
deps = [
":task_spec_proto",
":utils",
],
)
cc_library(
name = "term_frequency_map",
srcs = ["term_frequency_map.cc"],
hdrs = ["term_frequency_map.h"],
visibility = ["//visibility:public"],
deps = [
":utils",
],
alwayslink = 1,
)
cc_library(
name = "parser_transitions",
srcs = [
"arc_standard_transitions.cc",
"parser_state.cc",
"parser_transitions.cc",
"tagger_transitions.cc",
],
hdrs = [
"parser_state.h",
"parser_transitions.h",
],
deps = [
":kbest_syntax_proto",
":registry",
":shared_store",
":task_context",
":term_frequency_map",
],
alwayslink = 1,
)
cc_library(
name = "populate_test_inputs",
testonly = 1,
srcs = ["populate_test_inputs.cc"],
hdrs = ["populate_test_inputs.h"],
deps = [
":dictionary_proto",
":sentence_proto",
":task_context",
":term_frequency_map",
":test_main",
],
)
cc_library(
name = "parser_features",
srcs = ["parser_features.cc"],
hdrs = ["parser_features.h"],
deps = [
":affix",
":feature_extractor",
":parser_transitions",
":registry",
":sentence_features",
":sentence_proto",
":task_context",
":term_frequency_map",
":workspace",
],
alwayslink = 1,
)
cc_library(
name = "embedding_feature_extractor",
srcs = ["embedding_feature_extractor.cc"],
hdrs = ["embedding_feature_extractor.h"],
deps = [
":feature_extractor",
":parser_features",
":parser_transitions",
":sparse_proto",
":task_context",
":workspace",
],
)
cc_library(
name = "sentence_batch",
srcs = ["sentence_batch.cc"],
hdrs = ["sentence_batch.h"],
deps = [
":embedding_feature_extractor",
":feature_extractor",
":parser_features",
":parser_transitions",
":sparse_proto",
":task_context",
":task_spec_proto",
":term_frequency_map",
":workspace",
],
)
cc_library(
name = "reader_ops",
srcs = [
"beam_reader_ops.cc",
"reader_ops.cc",
],
deps = [
":parser_features",
":parser_transitions",
":sentence_batch",
":sentence_proto",
":task_context",
":task_spec_proto",
],
alwayslink = 1,
)
cc_library(
name = "document_filters",
srcs = ["document_filters.cc"],
deps = [
":document_format",
":parser_features",
":parser_transitions",
":sentence_batch",
":sentence_proto",
":task_context",
":task_spec_proto",
":text_formats",
],
alwayslink = 1,
)
cc_library(
name = "lexicon_builder",
srcs = ["lexicon_builder.cc"],
deps = [
":document_format",
":parser_features",
":parser_transitions",
":sentence_batch",
":sentence_proto",
":task_context",
":task_spec_proto",
":text_formats",
],
alwayslink = 1,
)
cc_library(
name = "unpack_sparse_features",
srcs = ["unpack_sparse_features.cc"],
deps = [
":sparse_proto",
":utils",
],
)
cc_library(
name = "parser_ops_cc",
srcs = ["ops/parser_ops.cc"],
deps = [
":base",
":document_filters",
":lexicon_builder",
":reader_ops",
":unpack_sparse_features",
],
alwayslink = 1,
)
cc_binary(
name = "parser_ops.so",
linkopts = select({
"//conditions:default": ["-lm"],
"@tf//tensorflow:darwin": [],
}),
linkshared = 1,
linkstatic = 1,
deps = [
":parser_ops_cc",
],
)
# cc tests
filegroup(
name = "testdata",
srcs = [
"testdata/context.pbtxt",
"testdata/document",
"testdata/mini-training-set",
],
)
cc_test(
name = "shared_store_test",
size = "small",
srcs = ["shared_store_test.cc"],
deps = [
":shared_store",
":test_main",
],
)
cc_test(
name = "sentence_features_test",
size = "medium",
srcs = ["sentence_features_test.cc"],
deps = [
":feature_extractor",
":populate_test_inputs",
":sentence_features",
":sentence_proto",
":task_context",
":task_spec_proto",
":term_frequency_map",
":test_main",
":workspace",
],
)
cc_test(
name = "arc_standard_transitions_test",
size = "small",
srcs = ["arc_standard_transitions_test.cc"],
data = [":testdata"],
deps = [
":parser_transitions",
":populate_test_inputs",
":test_main",
],
)
cc_test(
name = "tagger_transitions_test",
size = "small",
srcs = ["tagger_transitions_test.cc"],
data = [":testdata"],
deps = [
":parser_transitions",
":populate_test_inputs",
":test_main",
],
)
cc_test(
name = "parser_features_test",
size = "small",
srcs = ["parser_features_test.cc"],
deps = [
":feature_extractor",
":parser_features",
":parser_transitions",
":populate_test_inputs",
":sentence_proto",
":task_context",
":task_spec_proto",
":term_frequency_map",
":test_main",
":workspace",
],
)
# py graph builder and trainer
tf_gen_op_libs(
op_lib_names = ["parser_ops"],
)
tf_gen_op_wrapper_py(
name = "parser_ops",
deps = [":parser_ops_op_lib"],
)
py_library(
name = "load_parser_ops_py",
srcs = ["load_parser_ops.py"],
data = [":parser_ops.so"],
)
py_library(
name = "graph_builder",
srcs = ["graph_builder.py"],
deps = [
"@tf//tensorflow:tensorflow_py",
"@tf//tensorflow/core:protos_all_py",
":load_parser_ops_py",
":parser_ops",
],
)
py_library(
name = "structured_graph_builder",
srcs = ["structured_graph_builder.py"],
deps = [
":graph_builder",
],
)
py_binary(
name = "parser_trainer",
srcs = ["parser_trainer.py"],
deps = [
":graph_builder",
":structured_graph_builder",
":task_spec_py_pb2",
],
)
py_binary(
name = "parser_eval",
srcs = ["parser_eval.py"],
deps = [
":graph_builder",
":sentence_py_pb2",
":structured_graph_builder",
],
)
py_binary(
name = "conll2tree",
srcs = ["conll2tree.py"],
deps = [
":graph_builder",
":sentence_py_pb2",
],
)
# py tests
py_test(
name = "lexicon_builder_test",
size = "small",
srcs = ["lexicon_builder_test.py"],
deps = [
":graph_builder",
":sentence_py_pb2",
":task_spec_py_pb2",
],
)
py_test(
name = "text_formats_test",
size = "small",
srcs = ["text_formats_test.py"],
deps = [
":graph_builder",
":sentence_py_pb2",
":task_spec_py_pb2",
],
)
py_test(
name = "reader_ops_test",
size = "medium",
srcs = ["reader_ops_test.py"],
data = [":testdata"],
tags = ["notsan"],
deps = [
":dictionary_py_pb2",
":graph_builder",
":sparse_py_pb2",
],
)
py_test(
name = "beam_reader_ops_test",
size = "medium",
srcs = ["beam_reader_ops_test.py"],
data = [":testdata"],
tags = ["notsan"],
deps = [
":structured_graph_builder",
],
)
py_test(
name = "graph_builder_test",
size = "medium",
srcs = ["graph_builder_test.py"],
data = [
":testdata",
],
tags = ["notsan"],
deps = [
":graph_builder",
":sparse_py_pb2",
],
)
sh_test(
name = "parser_trainer_test",
size = "medium",
srcs = ["parser_trainer_test.sh"],
data = [
":parser_eval",
":parser_trainer",
":testdata",
],
tags = ["notsan"],
)
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#include "syntaxnet/affix.h"
#include <ctype.h>
#include <string.h>
#include <functional>
#include <string>
#include "syntaxnet/shared_store.h"
#include "syntaxnet/task_context.h"
#include "syntaxnet/term_frequency_map.h"
#include "syntaxnet/utils.h"
#include "syntaxnet/workspace.h"
#include "tensorflow/core/lib/core/status.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/platform/regexp.h"
#include "util/utf8/unicodetext.h"
namespace syntaxnet {
// Initial number of buckets in term and affix hash maps. This must be a power
// of two.
static const int kInitialBuckets = 1024;
// Fill factor for term and affix hash maps.
static const int kFillFactor = 2;
int TermHash(string term) {
return utils::Hash32(term.data(), term.size(), 0xDECAF);
}
// Copies a substring of a Unicode text to a string.
static void UnicodeSubstring(UnicodeText::const_iterator start,
UnicodeText::const_iterator end, string *result) {
result->clear();
result->append(start.utf8_data(), end.utf8_data() - start.utf8_data());
}
AffixTable::AffixTable(Type type, int max_length) {
type_ = type;
max_length_ = max_length;
Resize(0);
}
AffixTable::~AffixTable() { Reset(0); }
void AffixTable::Reset(int max_length) {
// Save new maximum affix length.
max_length_ = max_length;
// Delete all data.
for (size_t i = 0; i < affixes_.size(); ++i) delete affixes_[i];
affixes_.clear();
buckets_.clear();
Resize(0);
}
void AffixTable::Read(const AffixTableEntry &table_entry) {
CHECK_EQ(table_entry.type(), type_ == PREFIX ? "PREFIX" : "SUFFIX");
CHECK_GE(table_entry.max_length(), 0);
Reset(table_entry.max_length());
// First, create all affixes.
for (int affix_id = 0; affix_id < table_entry.affix_size(); ++affix_id) {
const auto &affix_entry = table_entry.affix(affix_id);
CHECK_GE(affix_entry.length(), 0);
CHECK_LE(affix_entry.length(), max_length_);
CHECK(FindAffix(affix_entry.form()) == NULL); // forbid duplicates
Affix *affix = AddNewAffix(affix_entry.form(), affix_entry.length());
CHECK_EQ(affix->id(), affix_id);
}
CHECK_EQ(affixes_.size(), table_entry.affix_size());
// Next, link the shorter affixes.
for (int affix_id = 0; affix_id < table_entry.affix_size(); ++affix_id) {
const auto &affix_entry = table_entry.affix(affix_id);
if (affix_entry.shorter_id() == -1) {
CHECK_EQ(affix_entry.length(), 1);
continue;
}
CHECK_GT(affix_entry.length(), 1);
CHECK_GE(affix_entry.shorter_id(), 0);
CHECK_LT(affix_entry.shorter_id(), affixes_.size());
Affix *affix = affixes_[affix_id];
Affix *shorter = affixes_[affix_entry.shorter_id()];
CHECK_EQ(affix->length(), shorter->length() + 1);
affix->set_shorter(shorter);
}
}
void AffixTable::Read(ProtoRecordReader *reader) {
AffixTableEntry table_entry;
TF_CHECK_OK(reader->Read(&table_entry));
Read(table_entry);
}
void AffixTable::Write(AffixTableEntry *table_entry) const {
table_entry->Clear();
table_entry->set_type(type_ == PREFIX ? "PREFIX" : "SUFFIX");
table_entry->set_max_length(max_length_);
for (const Affix *affix : affixes_) {
auto *affix_entry = table_entry->add_affix();
affix_entry->set_form(affix->form());
affix_entry->set_length(affix->length());
affix_entry->set_shorter_id(
affix->shorter() == NULL ? -1 : affix->shorter()->id());
}
}
void AffixTable::Write(ProtoRecordWriter *writer) const {
AffixTableEntry table_entry;
Write(&table_entry);
writer->Write(table_entry);
}
Affix *AffixTable::AddAffixesForWord(const char *word, size_t size) {
// The affix length is measured in characters and not bytes so we need to
// determine the length in characters.
UnicodeText text;
text.PointToUTF8(word, size);
int length = text.size();
// Determine longest affix.
int affix_len = length;
if (affix_len > max_length_) affix_len = max_length_;
if (affix_len == 0) return NULL;
// Find start and end of longest affix.
UnicodeText::const_iterator start, end;
if (type_ == PREFIX) {
start = end = text.begin();
for (int i = 0; i < affix_len; ++i) ++end;
} else {
start = end = text.end();
for (int i = 0; i < affix_len; ++i) --start;
}
// Try to find successively shorter affixes.
Affix *top = NULL;
Affix *ancestor = NULL;
string s;
while (affix_len > 0) {
// Try to find affix in table.
UnicodeSubstring(start, end, &s);
Affix *affix = FindAffix(s);
if (affix == NULL) {
// Affix not found, add new one to table.
affix = AddNewAffix(s, affix_len);
// Update ancestor chain.
if (ancestor != NULL) ancestor->set_shorter(affix);
ancestor = affix;
if (top == NULL) top = affix;
} else {
// Affix found. Update ancestor if needed and return match.
if (ancestor != NULL) ancestor->set_shorter(affix);
if (top == NULL) top = affix;
break;
}
// Next affix.
if (type_ == PREFIX) {
--end;
} else {
++start;
}
affix_len--;
}
return top;
}
Affix *AffixTable::GetAffix(int id) const {
if (id < 0 || id >= static_cast<int>(affixes_.size())) {
return NULL;
} else {
return affixes_[id];
}
}
string AffixTable::AffixForm(int id) const {
Affix *affix = GetAffix(id);
if (affix == NULL) {
return "";
} else {
return affix->form();
}
}
int AffixTable::AffixId(const string &form) const {
Affix *affix = FindAffix(form);
if (affix == NULL) {
return -1;
} else {
return affix->id();
}
}
Affix *AffixTable::AddNewAffix(const string &form, int length) {
int hash = TermHash(form);
int id = affixes_.size();
if (id > static_cast<int>(buckets_.size()) * kFillFactor) Resize(id);
int b = hash & (buckets_.size() - 1);
// Create new affix object.
Affix *affix = new Affix(id, form.c_str(), length);
affixes_.push_back(affix);
// Insert affix in bucket chain.
affix->next_ = buckets_[b];
buckets_[b] = affix;
return affix;
}
Affix *AffixTable::FindAffix(const string &form) const {
// Compute hash value for word.
int hash = TermHash(form);
// Try to find affix in hash table.
Affix *affix = buckets_[hash & (buckets_.size() - 1)];
while (affix != NULL) {
if (strcmp(affix->form_.c_str(), form.c_str()) == 0) return affix;
affix = affix->next_;
}
return NULL;
}
void AffixTable::Resize(int size_hint) {
// Compute new size for bucket array.
int new_size = kInitialBuckets;
while (new_size < size_hint) new_size *= 2;
int mask = new_size - 1;
// Distribute affixes in new buckets.
buckets_.resize(new_size);
for (size_t i = 0; i < buckets_.size(); ++i) {
buckets_[i] = NULL;
}
for (size_t i = 0; i < affixes_.size(); ++i) {
Affix *affix = affixes_[i];
int b = TermHash(affix->form_) & mask;
affix->next_ = buckets_[b];
buckets_[b] = affix;
}
}
} // namespace syntaxnet
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#ifndef $TARGETDIR_AFFIX_H_
#define $TARGETDIR_AFFIX_H_
#include <stddef.h>
#include <string>
#include <vector>
#include "syntaxnet/utils.h"
#include "syntaxnet/dictionary.pb.h"
#include "syntaxnet/feature_extractor.h"
#include "syntaxnet/proto_io.h"
#include "syntaxnet/sentence.pb.h"
#include "syntaxnet/task_context.h"
#include "syntaxnet/term_frequency_map.h"
#include "syntaxnet/workspace.h"
#include "tensorflow/core/lib/strings/strcat.h"
namespace syntaxnet {
// An affix represents a prefix or suffix of a word of a certain length. Each
// affix has a unique id and a textual form. An affix also has a pointer to the
// affix that is one character shorter. This creates a chain of affixes that are
// successively shorter.
class Affix {
private:
friend class AffixTable;
Affix(int id, const char *form, int length)
: id_(id), length_(length), form_(form), shorter_(NULL), next_(NULL) {}
public:
// Returns unique id of affix.
int id() const { return id_; }
// Returns the textual representation of the affix.
string form() const { return form_; }
// Returns the length of the affix.
int length() const { return length_; }
// Gets/sets the affix that is one character shorter.
Affix *shorter() const { return shorter_; }
void set_shorter(Affix *next) { shorter_ = next; }
private:
// Affix id.
int id_;
// Length (in characters) of affix.
int length_;
// Text form of affix.
string form_;
// Pointer to affix that is one character shorter.
Affix *shorter_;
// Next affix in bucket chain.
Affix *next_;
TF_DISALLOW_COPY_AND_ASSIGN(Affix);
};
// An affix table holds all prefixes/suffixes of all the words added to the
// table up to a maximum length. The affixes are chained together to enable
// fast lookup of all affixes for a word.
class AffixTable {
public:
// Affix table type.
enum Type { PREFIX, SUFFIX };
AffixTable(Type type, int max_length);
~AffixTable();
// Resets the affix table and initialize the table for affixes of up to the
// maximum length specified.
void Reset(int max_length);
// De-serializes this from the given proto.
void Read(const AffixTableEntry &table_entry);
// De-serializes this from the given records.
void Read(ProtoRecordReader *reader);
// Serializes this to the given proto.
void Write(AffixTableEntry *table_entry) const;
// Serializes this to the given records.
void Write(ProtoRecordWriter *writer) const;
// Adds all prefixes/suffixes of the word up to the maximum length to the
// table. The longest affix is returned. The pointers in the affix can be
// used for getting shorter affixes.
Affix *AddAffixesForWord(const char *word, size_t size);
// Gets the affix information for the affix with a certain id. Returns NULL if
// there is no affix in the table with this id.
Affix *GetAffix(int id) const;
// Gets affix form from id. If the affix does not exist in the table, an empty
// string is returned.
string AffixForm(int id) const;
// Gets affix id for affix. If the affix does not exist in the table, -1 is
// returned.
int AffixId(const string &form) const;
// Returns size of the affix table.
int size() const { return affixes_.size(); }
// Returns the maximum affix length.
int max_length() const { return max_length_; }
private:
// Adds a new affix to table.
Affix *AddNewAffix(const string &form, int length);
// Finds existing affix in table.
Affix *FindAffix(const string &form) const;
// Resizes bucket array.
void Resize(int size_hint);
// Affix type (prefix or suffix).
Type type_;
// Maximum length of affix.
int max_length_;
// Index from affix ids to affix items.
vector<Affix *> affixes_;
// Buckets for word-to-affix hash map.
vector<Affix *> buckets_;
TF_DISALLOW_COPY_AND_ASSIGN(AffixTable);
};
} // namespace syntaxnet
#endif // $TARGETDIR_AFFIX_H_
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
// Arc-standard transition system.
//
// This transition system has three types of actions:
// - The SHIFT action pushes the next input token to the stack and
// advances to the next input token.
// - The LEFT_ARC action adds a dependency relation from first to second token
// on the stack and removes second one.
// - The RIGHT_ARC action adds a dependency relation from second to first token
// on the stack and removes the first one.
//
// The transition system operates with parser actions encoded as integers:
// - A SHIFT action is encoded as 0.
// - A LEFT_ARC action is encoded as an odd number starting from 1.
// - A RIGHT_ARC action is encoded as an even number starting from 2.
#include <string>
#include "syntaxnet/utils.h"
#include "syntaxnet/parser_state.h"
#include "syntaxnet/parser_transitions.h"
#include "tensorflow/core/lib/strings/strcat.h"
namespace syntaxnet {
class ArcStandardTransitionState : public ParserTransitionState {
public:
// Clones the transition state by returning a new object.
ParserTransitionState *Clone() const override {
return new ArcStandardTransitionState();
}
// Pushes the root on the stack before using the parser state in parsing.
void Init(ParserState *state) override { state->Push(-1); }
// Adds transition state specific annotations to the document.
void AddParseToDocument(const ParserState &state, bool rewrite_root_labels,
Sentence *sentence) const override {
for (int i = 0; i < state.NumTokens(); ++i) {
Token *token = sentence->mutable_token(i);
token->set_label(state.LabelAsString(state.Label(i)));
if (state.Head(i) != -1) {
token->set_head(state.Head(i));
} else {
token->clear_head();
if (rewrite_root_labels) {
token->set_label(state.LabelAsString(state.RootLabel()));
}
}
}
}
// Whether a parsed token should be considered correct for evaluation.
bool IsTokenCorrect(const ParserState &state, int index) const override {
return state.GoldHead(index) == state.Head(index);
}
// Returns a human readable string representation of this state.
string ToString(const ParserState &state) const override {
string str;
str.append("[");
for (int i = state.StackSize() - 1; i >= 0; --i) {
const string &word = state.GetToken(state.Stack(i)).word();
if (i != state.StackSize() - 1) str.append(" ");
if (word == "") {
str.append(ParserState::kRootLabel);
} else {
str.append(word);
}
}
str.append("]");
for (int i = state.Next(); i < state.NumTokens(); ++i) {
tensorflow::strings::StrAppend(&str, " ", state.GetToken(i).word());
}
return str;
}
};
class ArcStandardTransitionSystem : public ParserTransitionSystem {
public:
// Action types for the arc-standard transition system.
enum ParserActionType {
SHIFT = 0,
LEFT_ARC = 1,
RIGHT_ARC = 2,
};
// The SHIFT action uses the same value as the corresponding action type.
static ParserAction ShiftAction() { return SHIFT; }
// The LEFT_ARC action converts the label to an odd number greater or equal
// to 1.
static ParserAction LeftArcAction(int label) { return 1 + (label << 1); }
// The RIGHT_ARC action converts the label to an even number greater or equal
// to 2.
static ParserAction RightArcAction(int label) {
return 1 + ((label << 1) | 1);
}
// Extracts the action type from a given parser action.
static ParserActionType ActionType(ParserAction action) {
return static_cast<ParserActionType>(action < 1 ? action
: 1 + (~action & 1));
}
// Extracts the label from a given parser action. If the action is SHIFT,
// returns -1.
static int Label(ParserAction action) {
return action < 1 ? -1 : (action - 1) >> 1;
}
// Returns the number of action types.
int NumActionTypes() const override { return 3; }
// Returns the number of possible actions.
int NumActions(int num_labels) const override { return 1 + 2 * num_labels; }
// The method returns the default action for a given state.
ParserAction GetDefaultAction(const ParserState &state) const override {
// If there are further tokens available in the input then Shift.
if (!state.EndOfInput()) return ShiftAction();
// Do a "reduce".
return RightArcAction(2);
}
// Returns the next gold action for a given state according to the
// underlying annotated sentence.
ParserAction GetNextGoldAction(const ParserState &state) const override {
// If the stack contains less than 2 tokens, the only valid parser action is
// shift.
if (state.StackSize() < 2) {
DCHECK(!state.EndOfInput());
return ShiftAction();
}
// If the second token on the stack is the head of the first one,
// return a right arc action.
if (state.GoldHead(state.Stack(0)) == state.Stack(1) &&
DoneChildrenRightOf(state, state.Stack(0))) {
const int gold_label = state.GoldLabel(state.Stack(0));
return RightArcAction(gold_label);
}
// If the first token on the stack is the head of the second one,
// return a left arc action.
if (state.GoldHead(state.Stack(1)) == state.Top()) {
const int gold_label = state.GoldLabel(state.Stack(1));
return LeftArcAction(gold_label);
}
// Otherwise, shift.
return ShiftAction();
}
// Determines if a token has any children to the right in the sentence.
// Arc standard is a bottom-up parsing method and has to finish all sub-trees
// first.
static bool DoneChildrenRightOf(const ParserState &state, int head) {
int index = state.Next();
int num_tokens = state.sentence().token_size();
while (index < num_tokens) {
// Check if the token at index is the child of head.
int actual_head = state.GoldHead(index);
if (actual_head == head) return false;
// If the head of the token at index is to the right of it there cannot be
// any children in-between, so we can skip forward to the head. Note this
// is only true for projective trees.
if (actual_head > index) {
index = actual_head;
} else {
++index;
}
}
return true;
}
// Checks if the action is allowed in a given parser state.
bool IsAllowedAction(ParserAction action,
const ParserState &state) const override {
switch (ActionType(action)) {
case SHIFT:
return IsAllowedShift(state);
case LEFT_ARC:
return IsAllowedLeftArc(state);
case RIGHT_ARC:
return IsAllowedRightArc(state);
}
return false;
}
// Returns true if a shift is allowed in the given parser state.
bool IsAllowedShift(const ParserState &state) const {
// We can shift if there are more input tokens.
return !state.EndOfInput();
}
// Returns true if a left-arc is allowed in the given parser state.
bool IsAllowedLeftArc(const ParserState &state) const {
// Left-arc requires two or more tokens on the stack but the first token
// is the root an we do not want and left arc to the root.
return state.StackSize() > 2;
}
// Returns true if a right-arc is allowed in the given parser state.
bool IsAllowedRightArc(const ParserState &state) const {
// Right arc requires three or more tokens on the stack.
return state.StackSize() > 1;
}
// Performs the specified action on a given parser state, without adding the
// action to the state's history.
void PerformActionWithoutHistory(ParserAction action,
ParserState *state) const override {
switch (ActionType(action)) {
case SHIFT:
PerformShift(state);
break;
case LEFT_ARC:
PerformLeftArc(state, Label(action));
break;
case RIGHT_ARC:
PerformRightArc(state, Label(action));
break;
}
}
// Makes a shift by pushing the next input token on the stack and moving to
// the next position.
void PerformShift(ParserState *state) const {
DCHECK(IsAllowedShift(*state));
state->Push(state->Next());
state->Advance();
}
// Makes a left-arc between the two top tokens on stack and pops the second
// token on stack.
void PerformLeftArc(ParserState *state, int label) const {
DCHECK(IsAllowedLeftArc(*state));
int s0 = state->Pop();
state->AddArc(state->Pop(), s0, label);
state->Push(s0);
}
// Makes a right-arc between the two top tokens on stack and pops the stack.
void PerformRightArc(ParserState *state, int label) const {
DCHECK(IsAllowedRightArc(*state));
int s0 = state->Pop();
int s1 = state->Pop();
state->AddArc(s0, s1, label);
state->Push(s1);
}
// We are in a deterministic state when we either reached the end of the input
// or reduced everything from the stack.
bool IsDeterministicState(const ParserState &state) const override {
return state.StackSize() < 2 && !state.EndOfInput();
}
// We are in a final state when we reached the end of the input and the stack
// is empty.
bool IsFinalState(const ParserState &state) const override {
return state.EndOfInput() && state.StackSize() < 2;
}
// Returns a string representation of a parser action.
string ActionAsString(ParserAction action,
const ParserState &state) const override {
switch (ActionType(action)) {
case SHIFT:
return "SHIFT";
case LEFT_ARC:
return "LEFT_ARC(" + state.LabelAsString(Label(action)) + ")";
case RIGHT_ARC:
return "RIGHT_ARC(" + state.LabelAsString(Label(action)) + ")";
}
return "UNKNOWN";
}
// Returns a new transition state to be used to enhance the parser state.
ParserTransitionState *NewTransitionState(bool training_mode) const override {
return new ArcStandardTransitionState();
}
};
REGISTER_TRANSITION_SYSTEM("arc-standard", ArcStandardTransitionSystem);
} // namespace syntaxnet
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#include <memory>
#include <string>
#include <gmock/gmock.h>
#include "syntaxnet/utils.h"
#include "syntaxnet/parser_state.h"
#include "syntaxnet/parser_transitions.h"
#include "syntaxnet/populate_test_inputs.h"
#include "syntaxnet/sentence.pb.h"
#include "syntaxnet/task_context.h"
#include "syntaxnet/task_spec.pb.h"
#include "syntaxnet/term_frequency_map.h"
#include "tensorflow/core/lib/core/status.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/platform/test.h"
namespace syntaxnet {
class ArcStandardTransitionTest : public ::testing::Test {
public:
ArcStandardTransitionTest()
: transition_system_(ParserTransitionSystem::Create("arc-standard")) {}
protected:
// Creates a label map and a tag map for testing based on the given
// document and initializes the transition system appropriately.
void SetUpForDocument(const Sentence &document) {
input_label_map_ = context_.GetInput("label-map", "text", "");
transition_system_->Setup(&context_);
PopulateTestInputs::Defaults(document).Populate(&context_);
label_map_.Load(TaskContext::InputFile(*input_label_map_),
0 /* minimum frequency */,
-1 /* maximum number of terms */);
transition_system_->Init(&context_);
}
// Creates a cloned state from a sentence in order to test that cloning
// works correctly for the new parser states.
ParserState *NewClonedState(Sentence *sentence) {
ParserState state(sentence, transition_system_->NewTransitionState(
true /* training mode */),
&label_map_);
return state.Clone();
}
// Performs gold transitions and check that the labels and heads recorded
// in the parser state match gold heads and labels.
void GoldParse(Sentence *sentence) {
ParserState *state = NewClonedState(sentence);
LOG(INFO) << "Initial parser state: " << state->ToString();
while (!transition_system_->IsFinalState(*state)) {
ParserAction action = transition_system_->GetNextGoldAction(*state);
EXPECT_TRUE(transition_system_->IsAllowedAction(action, *state));
LOG(INFO) << "Performing action: "
<< transition_system_->ActionAsString(action, *state);
transition_system_->PerformActionWithoutHistory(action, state);
LOG(INFO) << "Parser state: " << state->ToString();
}
for (int i = 0; i < sentence->token_size(); ++i) {
EXPECT_EQ(state->GoldLabel(i), state->Label(i));
EXPECT_EQ(state->GoldHead(i), state->Head(i));
}
delete state;
}
// Always takes the default action, and verifies that this leads to
// a final state through a sequence of allowed actions.
void DefaultParse(Sentence *sentence) {
ParserState *state = NewClonedState(sentence);
LOG(INFO) << "Initial parser state: " << state->ToString();
while (!transition_system_->IsFinalState(*state)) {
ParserAction action = transition_system_->GetDefaultAction(*state);
EXPECT_TRUE(transition_system_->IsAllowedAction(action, *state));
LOG(INFO) << "Performing action: "
<< transition_system_->ActionAsString(action, *state);
transition_system_->PerformActionWithoutHistory(action, state);
LOG(INFO) << "Parser state: " << state->ToString();
}
delete state;
}
TaskContext context_;
TaskInput *input_label_map_ = nullptr;
TermFrequencyMap label_map_;
std::unique_ptr<ParserTransitionSystem> transition_system_;
};
TEST_F(ArcStandardTransitionTest, SingleSentenceDocumentTest) {
string document_text;
Sentence document;
TF_CHECK_OK(ReadFileToString(
tensorflow::Env::Default(),
"syntaxnet/testdata/document",
&document_text));
LOG(INFO) << "see doc\n:" << document_text;
CHECK(TextFormat::ParseFromString(document_text, &document));
SetUpForDocument(document);
GoldParse(&document);
DefaultParse(&document);
}
} // namespace syntaxnet
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#ifndef $TARGETDIR_BASE_H_
#define $TARGETDIR_BASE_H_
#include <functional>
#include <string>
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include "tensorflow/core/lib/core/status.h"
#include "tensorflow/core/lib/strings/strcat.h"
#include "tensorflow/core/lib/strings/stringprintf.h"
#include "tensorflow/core/platform/default/integral_types.h"
#include "tensorflow/core/platform/mutex.h"
#include "tensorflow/core/platform/protobuf.h"
using tensorflow::int32;
using tensorflow::int64;
using tensorflow::uint64;
using tensorflow::uint32;
using tensorflow::uint32;
using tensorflow::protobuf::TextFormat;
using tensorflow::mutex_lock;
using tensorflow::mutex;
using std::map;
using std::pair;
using std::vector;
using std::unordered_map;
using std::unordered_set;
typedef signed int char32;
using tensorflow::StringPiece;
using std::string;
// namespace syntaxnet
#endif // $TARGETDIR_BASE_H_
/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#include <algorithm>
#include <deque>
#include <map>
#include <memory>
#include <string>
#include <utility>
#include <vector>
#include "syntaxnet/base.h"
#include "syntaxnet/parser_state.h"
#include "syntaxnet/parser_transitions.h"
#include "syntaxnet/sentence_batch.h"
#include "syntaxnet/sentence.pb.h"
#include "syntaxnet/shared_store.h"
#include "syntaxnet/sparse.pb.h"
#include "syntaxnet/task_context.h"
#include "syntaxnet/task_spec.pb.h"
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/tensor.h"
#include "tensorflow/core/framework/tensor_shape.h"
#include "tensorflow/core/lib/core/status.h"
#include "tensorflow/core/lib/io/inputbuffer.h"
#include "tensorflow/core/platform/env.h"
using tensorflow::DEVICE_CPU;
using tensorflow::DT_BOOL;
using tensorflow::DT_FLOAT;
using tensorflow::DT_INT32;
using tensorflow::DT_INT64;
using tensorflow::DT_STRING;
using tensorflow::DataType;
using tensorflow::OpKernel;
using tensorflow::OpKernelConstruction;
using tensorflow::OpKernelContext;
using tensorflow::TTypes;
using tensorflow::Tensor;
using tensorflow::TensorShape;
using tensorflow::errors::FailedPrecondition;
using tensorflow::errors::InvalidArgument;
namespace syntaxnet {
// Wraps ParserState so that the history of transitions (actions
// performed and the beam slot they were performed in) are recorded.
struct ParserStateWithHistory {
public:
// New state with an empty history.
explicit ParserStateWithHistory(const ParserState &s) : state(s.Clone()) {}
// New state obtained by cloning the given state and applying the given
// action. The given beam slot and action are appended to the history.
ParserStateWithHistory(const ParserStateWithHistory &next,
const ParserTransitionSystem &transitions, int32 slot,
int32 action, float score)
: state(next.state->Clone()),
slot_history(next.slot_history),
action_history(next.action_history),
score_history(next.score_history) {
transitions.PerformAction(action, state.get());
slot_history.push_back(slot);
action_history.push_back(action);
score_history.push_back(score);
}
std::unique_ptr<ParserState> state;
std::vector<int32> slot_history;
std::vector<int32> action_history;
std::vector<float> score_history;
private:
TF_DISALLOW_COPY_AND_ASSIGN(ParserStateWithHistory);
};
struct BatchStateOptions {
// Maximum number of parser states in a beam.
int max_beam_size;
// Number of parallel sentences to decode.
int batch_size;
// Argument prefix for context parameters.
string arg_prefix;
// Corpus name to read from from context inputs.
string corpus_name;
// Whether we allow weights in SparseFeatures protos.
bool allow_feature_weights;
// Whether beams should be considered alive until all states are final, or
// until the gold path falls off.
bool continue_until_all_final;
// Whether to skip to a new sentence after each training step.
bool always_start_new_sentences;
// Parameter for deciding which tokens to score.
string scoring_type;
};
// Encapsulates the environment needed to parse with a beam, keeping a
// record of path histories.
class BeamState {
public:
// The agenda is keyed by a tuple that is the score followed by an
// int that is -1 if the path coincides with the gold path and 0
// otherwise. The lexicographic ordering of the keys therefore
// ensures that for all paths sharing the same score, the gold path
// will always be at the bottom. This situation can occur at the
// onset of training when all weights are zero and therefore all
// paths have an identically zero score.
typedef std::pair<double, int> KeyType;
typedef std::multimap<KeyType, std::unique_ptr<ParserStateWithHistory>>
AgendaType;
typedef std::pair<const KeyType, std::unique_ptr<ParserStateWithHistory>>
AgendaItem;
typedef Eigen::Tensor<float, 2, Eigen::RowMajor, Eigen::DenseIndex>
ScoreMatrixType;
// The beam can be
// - ALIVE: parsing is still active, features are being output for at least
// some slots in the beam.
// - DYING: features should be output for this beam only one more time, then
// the beam will be DEAD. This state is reached when the gold path falls
// out of the beam and features have to be output one last time.
// - DEAD: parsing is not active, features are not being output and the no
// actions are taken on the states.
enum State { ALIVE = 0, DYING = 1, DEAD = 2 };
explicit BeamState(const BatchStateOptions &options) : options_(options) {}
void Reset() {
if (options_.always_start_new_sentences ||
gold_ == nullptr || transition_system_->IsFinalState(*gold_)) {
AdvanceSentence();
}
slots_.clear();
if (gold_ == nullptr) {
state_ = DEAD; // EOF has been reached.
} else {
gold_->set_is_gold(true);
slots_.emplace(KeyType(0.0, -1), std::unique_ptr<ParserStateWithHistory>(
new ParserStateWithHistory(*gold_)));
state_ = ALIVE;
}
}
void UpdateAllFinal() {
all_final_ = true;
for (const AgendaItem &item : slots_) {
if (!transition_system_->IsFinalState(*item.second->state)) {
all_final_ = false;
break;
}
}
if (all_final_) {
state_ = DEAD;
}
}
// This method updates the beam. For all elements of the beam, all
// allowed transitions are scored and insterted into a new beam. The
// beam size is capped by discarding the lowest scoring slots at any
// given time. There is one exception to this process: the gold path
// is forced to remain in the beam at all times, even if it scores
// low. This is to ensure that the gold path can be used for
// training at the moment it would otherwise fall off (and be absent
// from) the beam.
void Advance(const ScoreMatrixType &scores) {
// If the beam was in the state of DYING, it is now DEAD.
if (state_ == DYING) state_ = DEAD;
// When to stop advancing depends on the 'continue_until_all_final' arg.
if (!IsAlive() || gold_ == nullptr) return;
AdvanceGold();
const int score_rows = scores.dimension(0);
const int num_actions = scores.dimension(1);
// Advance beam.
AgendaType previous_slots;
previous_slots.swap(slots_);
CHECK_EQ(state_, ALIVE);
int slot = 0;
for (AgendaItem &item : previous_slots) {
{
ParserState *current = item.second->state.get();
VLOG(2) << "Slot: " << slot;
VLOG(2) << "Parser state: " << current->ToString();
VLOG(2) << "Parser state cumulative score: " << item.first.first << " "
<< (item.first.second < 0 ? "golden" : "");
}
if (!transition_system_->IsFinalState(*item.second->state)) {
// Not a final state.
for (int action = 0; action < num_actions; ++action) {
// Is action allowed?
if (!transition_system_->IsAllowedAction(action,
*item.second->state)) {
continue;
}
CHECK_LT(slot, score_rows);
MaybeInsertWithNewAction(item, slot, scores(slot, action), action);
PruneBeam();
}
} else {
// Final state: no need to advance.
MaybeInsert(&item);
PruneBeam();
}
++slot;
}
UpdateAllFinal();
}
void PopulateFeatureOutputs(
std::vector<std::vector<std::vector<SparseFeatures>>> *features) {
for (const AgendaItem &item : slots_) {
VLOG(2) << "State: " << item.second->state->ToString();
std::vector<std::vector<SparseFeatures>> f =
features_->ExtractSparseFeatures(*workspace_, *item.second->state);
for (size_t i = 0; i < f.size(); ++i) (*features)[i].push_back(f[i]);
}
}
int BeamSize() const { return slots_.size(); }
bool IsAlive() const { return state_ == ALIVE; }
bool IsDead() const { return state_ == DEAD; }
bool AllFinal() const { return all_final_; }
// The current contents of the beam.
AgendaType slots_;
// Which batch this refers to.
int beam_id_ = 0;
// Sentence batch reader.
SentenceBatch *sentence_batch_ = nullptr;
// Label map.
const TermFrequencyMap *label_map_ = nullptr;
// Transition system.
const ParserTransitionSystem *transition_system_ = nullptr;
// Feature extractor.
const ParserEmbeddingFeatureExtractor *features_ = nullptr;
// Feature workspace set.
WorkspaceSet *workspace_ = nullptr;
// Internal workspace registry for use in feature extraction.
WorkspaceRegistry *workspace_registry_ = nullptr;
// ParserState used to get gold actions.
std::unique_ptr<ParserState> gold_;
private:
// Creates a new ParserState if there's another sentence to be read.
void AdvanceSentence() {
gold_.reset();
if (sentence_batch_->AdvanceSentence(beam_id_)) {
gold_.reset(new ParserState(sentence_batch_->sentence(beam_id_),
transition_system_->NewTransitionState(true),
label_map_));
workspace_->Reset(*workspace_registry_);
features_->Preprocess(workspace_, gold_.get());
}
}
void AdvanceGold() {
gold_action_ = -1;
if (!transition_system_->IsFinalState(*gold_)) {
gold_action_ = transition_system_->GetNextGoldAction(*gold_);
if (transition_system_->IsAllowedAction(gold_action_, *gold_)) {
// In cases where the gold annotation is incompatible with the
// transition system, the action returned as gold might be not allowed.
transition_system_->PerformAction(gold_action_, gold_.get());
}
}
}
// Removes the first non-gold beam element if the beam is larger than
// the maximum beam size. If the gold element was at the bottom of the
// beam, sets the beam state to DYING, otherwise leaves the state alone.
void PruneBeam() {
if (static_cast<int>(slots_.size()) > options_.max_beam_size) {
auto bottom = slots_.begin();
if (!options_.continue_until_all_final &&
bottom->second->state->is_gold()) {
state_ = DYING;
++bottom;
}
slots_.erase(bottom);
}
}
// Inserts an item in the beam if
// - the item is gold,
// - the beam is not full, or
// - the item's new score is greater than the lowest score in the beam after
// the score has been incremented by given delta_score.
// Inserted items have slot, delta_score and action appended to their history.
void MaybeInsertWithNewAction(const AgendaItem &item, const int slot,
const double delta_score, const int action) {
const double score = item.first.first + delta_score;
const bool is_gold =
item.second->state->is_gold() && action == gold_action_;
if (is_gold || static_cast<int>(slots_.size()) < options_.max_beam_size ||
score > slots_.begin()->first.first) {
const KeyType key{score, -static_cast<int>(is_gold)};
slots_.emplace(key, std::unique_ptr<ParserStateWithHistory>(
new ParserStateWithHistory(
*item.second, *transition_system_, slot,
action, delta_score)))
->second->state->set_is_gold(is_gold);
}
}
// Inserts an item in the beam if
// - the item is gold,
// - the beam is not full, or
// - the item's new score is greater than the lowest score in the beam.
// The history of inserted items is left untouched.
void MaybeInsert(AgendaItem *item) {
const bool is_gold = item->second->state->is_gold();
const double score = item->first.first;
if (is_gold || static_cast<int>(slots_.size()) < options_.max_beam_size ||
score > slots_.begin()->first.first) {
slots_.emplace(item->first, std::move(item->second));
}
}
// Limits the number of slots on the beam.
const BatchStateOptions &options_;
int gold_action_ = -1;
State state_ = ALIVE;
bool all_final_ = false;
TF_DISALLOW_COPY_AND_ASSIGN(BeamState);
};
// Encapsulates the state of a batch of beams. It is an object of this
// type that will persist through repeated Op evaluations as the
// multiple steps are computed in sequence.
class BatchState {
public:
explicit BatchState(const BatchStateOptions &options)
: options_(options), features_(options.arg_prefix) {}
~BatchState() { SharedStore::Release(label_map_); }
void Init(TaskContext *task_context) {
// Create sentence batch.
sentence_batch_.reset(
new SentenceBatch(BatchSize(), options_.corpus_name));
sentence_batch_->Init(task_context);
// Create transition system.
transition_system_.reset(ParserTransitionSystem::Create(task_context->Get(
tensorflow::strings::StrCat(options_.arg_prefix, "_transition_system"),
"arc-standard")));
transition_system_->Setup(task_context);
transition_system_->Init(task_context);
// Create label map.
string label_map_path =
TaskContext::InputFile(*task_context->GetInput("label-map"));
label_map_ = SharedStoreUtils::GetWithDefaultName<TermFrequencyMap>(
label_map_path, 0, 0);
// Setup features.
features_.Setup(task_context);
features_.Init(task_context);
features_.RequestWorkspaces(&workspace_registry_);
// Create workspaces.
workspaces_.resize(BatchSize());
// Create beams.
beams_.clear();
for (int beam_id = 0; beam_id < BatchSize(); ++beam_id) {
beams_.emplace_back(options_);
beams_[beam_id].beam_id_ = beam_id;
beams_[beam_id].sentence_batch_ = sentence_batch_.get();
beams_[beam_id].transition_system_ = transition_system_.get();
beams_[beam_id].label_map_ = label_map_;
beams_[beam_id].features_ = &features_;
beams_[beam_id].workspace_ = &workspaces_[beam_id];
beams_[beam_id].workspace_registry_ = &workspace_registry_;
}
}
void ResetBeams() {
for (BeamState &beam : beams_) {
beam.Reset();
}
// Rewind if no states remain in the batch (we need to rewind the corpus).
if (sentence_batch_->size() == 0) {
++epoch_;
VLOG(2) << "Starting epoch " << epoch_;
sentence_batch_->Rewind();
}
}
// Resets the offset vectors required for a single run because we're
// starting a new matrix of scores.
void ResetOffsets() {
beam_offsets_.clear();
step_offsets_ = {0};
UpdateOffsets();
}
void AdvanceBeam(const int beam_id,
const TTypes<float>::ConstMatrix &scores) {
const int offset = beam_offsets_.back()[beam_id];
Eigen::array<Eigen::DenseIndex, 2> offsets = {offset, 0};
Eigen::array<Eigen::DenseIndex, 2> extents = {
beam_offsets_.back()[beam_id + 1] - offset, NumActions()};
BeamState::ScoreMatrixType beam_scores = scores.slice(offsets, extents);
beams_[beam_id].Advance(beam_scores);
}
void UpdateOffsets() {
beam_offsets_.emplace_back(BatchSize() + 1, 0);
std::vector<int> &offsets = beam_offsets_.back();
for (int beam_id = 0; beam_id < BatchSize(); ++beam_id) {
// If the beam is ALIVE or DYING (but not DEAD), we want to
// output the activations.
const BeamState &beam = beams_[beam_id];
const int beam_size = beam.IsDead() ? 0 : beam.BeamSize();
offsets[beam_id + 1] = offsets[beam_id] + beam_size;
}
const int output_size = offsets.back();
step_offsets_.push_back(step_offsets_.back() + output_size);
}
tensorflow::Status PopulateFeatureOutputs(OpKernelContext *context) {
const int feature_size = FeatureSize();
std::vector<std::vector<std::vector<SparseFeatures>>> features(
feature_size);
for (int beam_id = 0; beam_id < BatchSize(); ++beam_id) {
if (!beams_[beam_id].IsDead()) {
beams_[beam_id].PopulateFeatureOutputs(&features);
}
}
CHECK_EQ(features.size(), feature_size);
Tensor *output;
const int total_slots = beam_offsets_.back().back();
for (int i = 0; i < feature_size; ++i) {
std::vector<std::vector<SparseFeatures>> &f = features[i];
CHECK_EQ(total_slots, f.size());
if (total_slots == 0) {
TF_RETURN_IF_ERROR(
context->allocate_output(i, TensorShape({0, 0}), &output));
} else {
const int size = f[0].size();
TF_RETURN_IF_ERROR(context->allocate_output(
i, TensorShape({total_slots, size}), &output));
for (int j = 0; j < total_slots; ++j) {
CHECK_EQ(size, f[j].size());
for (int k = 0; k < size; ++k) {
if (!options_.allow_feature_weights && f[j][k].weight_size() > 0) {
return FailedPrecondition(
"Feature weights are not allowed when allow_feature_weights "
"is set to false.");
}
output->matrix<string>()(j, k) = f[j][k].SerializeAsString();
}
}
}
}
return tensorflow::Status::OK();
}
// Returns the offset (i.e. row number) of a particular beam at a
// particular step in the final concatenated score matrix.
int GetOffset(const int step, const int beam_id) const {
return step_offsets_[step] + beam_offsets_[step][beam_id];
}
int FeatureSize() const { return features_.embedding_dims().size(); }
int NumActions() const {
return transition_system_->NumActions(label_map_->Size());
}
int BatchSize() const { return options_.batch_size; }
const BeamState &Beam(const int i) const { return beams_[i]; }
int Epoch() const { return epoch_; }
const string &ScoringType() const { return options_.scoring_type; }
private:
const BatchStateOptions options_;
// How many times the document source has been rewound.
int epoch_ = 0;
// Batch of sentences, and the corresponding parser states.
std::unique_ptr<SentenceBatch> sentence_batch_;
// Transition system.
std::unique_ptr<ParserTransitionSystem> transition_system_;
// Label map for transition system..
const TermFrequencyMap *label_map_;
// Typed feature extractor for embeddings.
ParserEmbeddingFeatureExtractor features_;
// Batch: WorkspaceSet objects.
std::vector<WorkspaceSet> workspaces_;
// Internal workspace registry for use in feature extraction.
WorkspaceRegistry workspace_registry_;
std::deque<BeamState> beams_;
std::vector<std::vector<int>> beam_offsets_;
// Keeps track of the slot offset of each step.
std::vector<int> step_offsets_;
TF_DISALLOW_COPY_AND_ASSIGN(BatchState);
};
// Creates a BeamState and hooks it up with a parser. This Op needs to
// remain alive for the duration of the parse.
class BeamParseReader : public OpKernel {
public:
explicit BeamParseReader(OpKernelConstruction *context) : OpKernel(context) {
string file_path;
int feature_size;
BatchStateOptions options;
OP_REQUIRES_OK(context, context->GetAttr("task_context", &file_path));
OP_REQUIRES_OK(context, context->GetAttr("feature_size", &feature_size));
OP_REQUIRES_OK(context,
context->GetAttr("beam_size", &options.max_beam_size));
OP_REQUIRES_OK(context,
context->GetAttr("batch_size", &options.batch_size));
OP_REQUIRES_OK(context,
context->GetAttr("arg_prefix", &options.arg_prefix));
OP_REQUIRES_OK(context,
context->GetAttr("corpus_name", &options.corpus_name));
OP_REQUIRES_OK(context, context->GetAttr("allow_feature_weights",
&options.allow_feature_weights));
OP_REQUIRES_OK(context,
context->GetAttr("continue_until_all_final",
&options.continue_until_all_final));
OP_REQUIRES_OK(context,
context->GetAttr("always_start_new_sentences",
&options.always_start_new_sentences));
// Reads task context from file.
string data;
OP_REQUIRES_OK(context, ReadFileToString(tensorflow::Env::Default(),
file_path, &data));
TaskContext task_context;
OP_REQUIRES(context,
TextFormat::ParseFromString(data, task_context.mutable_spec()),
InvalidArgument("Could not parse task context at ", file_path));
OP_REQUIRES(
context, options.batch_size > 0,
InvalidArgument("Batch size ", options.batch_size, " too small."));
options.scoring_type = task_context.Get(
tensorflow::strings::StrCat(options.arg_prefix, "_scoring"), "");
// Create batch state.
batch_state_.reset(new BatchState(options));
batch_state_->Init(&task_context);
// Check number of feature groups matches the task context.
const int required_size = batch_state_->FeatureSize();
OP_REQUIRES(
context, feature_size == required_size,
InvalidArgument("Task context requires feature_size=", required_size));
// Set expected signature.
std::vector<DataType> output_types(feature_size, DT_STRING);
output_types.push_back(DT_INT64);
output_types.push_back(DT_INT32);
OP_REQUIRES_OK(context, context->MatchSignature({}, output_types));
}
void Compute(OpKernelContext *context) override {
mutex_lock lock(mu_);
// Write features.
batch_state_->ResetBeams();
batch_state_->ResetOffsets();
batch_state_->PopulateFeatureOutputs(context);
// Forward the beam state vector.
Tensor *output;
const int feature_size = batch_state_->FeatureSize();
OP_REQUIRES_OK(context, context->allocate_output(feature_size,
TensorShape({}), &output));
output->scalar<int64>()() = reinterpret_cast<int64>(batch_state_.get());
// Output number of epochs.
OP_REQUIRES_OK(context, context->allocate_output(feature_size + 1,
TensorShape({}), &output));
output->scalar<int32>()() = batch_state_->Epoch();
}
private:
// mutex to synchronize access to Compute.
mutex mu_;
// The object whose handle will be passed among the Ops.
std::unique_ptr<BatchState> batch_state_;
TF_DISALLOW_COPY_AND_ASSIGN(BeamParseReader);
};
REGISTER_KERNEL_BUILDER(Name("BeamParseReader").Device(DEVICE_CPU),
BeamParseReader);
// Updates the beam based on incoming scores and outputs new feature vectors
// based on the updated beam.
class BeamParser : public OpKernel {
public:
explicit BeamParser(OpKernelConstruction *context) : OpKernel(context) {
int feature_size;
OP_REQUIRES_OK(context, context->GetAttr("feature_size", &feature_size));
// Set expected signature.
std::vector<DataType> output_types(feature_size, DT_STRING);
output_types.push_back(DT_INT64);
output_types.push_back(DT_BOOL);
OP_REQUIRES_OK(context,
context->MatchSignature({DT_INT64, DT_FLOAT}, output_types));
}
void Compute(OpKernelContext *context) override {
BatchState *batch_state =
reinterpret_cast<BatchState *>(context->input(0).scalar<int64>()());
const TTypes<float>::ConstMatrix scores = context->input(1).matrix<float>();
VLOG(2) << "Scores: " << scores;
CHECK_EQ(scores.dimension(1), batch_state->NumActions());
// In AdvanceBeam we use beam_offsets_[beam_id] to determine the slice of
// scores that should be used for advancing, but beam_offsets_[beam_id] only
// exists for beams that have a sentence loaded.
const int batch_size = batch_state->BatchSize();
for (int beam_id = 0; beam_id < batch_size; ++beam_id) {
batch_state->AdvanceBeam(beam_id, scores);
}
batch_state->UpdateOffsets();
// Forward the beam state unmodified.
Tensor *output;
const int feature_size = batch_state->FeatureSize();
OP_REQUIRES_OK(context, context->allocate_output(feature_size,
TensorShape({}), &output));
output->scalar<int64>()() = context->input(0).scalar<int64>()();
// Output the new features of all the slots in all the beams.
OP_REQUIRES_OK(context, batch_state->PopulateFeatureOutputs(context));
// Output whether the beams are alive.
OP_REQUIRES_OK(
context, context->allocate_output(feature_size + 1,
TensorShape({batch_size}), &output));
for (int beam_id = 0; beam_id < batch_size; ++beam_id) {
output->vec<bool>()(beam_id) = batch_state->Beam(beam_id).IsAlive();
}
}
private:
TF_DISALLOW_COPY_AND_ASSIGN(BeamParser);
};
REGISTER_KERNEL_BUILDER(Name("BeamParser").Device(DEVICE_CPU), BeamParser);
// Extracts the paths for the elements of the current beams and returns
// indices into a scoring matrix that is assumed to have been
// constructed along with the beam search.
class BeamParserOutput : public OpKernel {
public:
explicit BeamParserOutput(OpKernelConstruction *context) : OpKernel(context) {
// Set expected signature.
OP_REQUIRES_OK(context,
context->MatchSignature(
{DT_INT64}, {DT_INT32, DT_INT32, DT_INT32, DT_FLOAT}));
}
void Compute(OpKernelContext *context) override {
BatchState *batch_state =
reinterpret_cast<BatchState *>(context->input(0).scalar<int64>()());
const int num_actions = batch_state->NumActions();
const int batch_size = batch_state->BatchSize();
// Vectors for output.
//
// Each step of each batch:path gets its index computed and a
// unique path id assigned.
std::vector<int32> indices;
std::vector<int32> path_ids;
// Each unique path gets a batch id and a slot (in the beam)
// id. These are in effect the row and column of the final
// 'logits' matrix going to CrossEntropy.
std::vector<int32> beam_ids;
std::vector<int32> slot_ids;
// To compute the cross entropy we also need the slot id of the
// gold path, one per batch.
std::vector<int32> gold_slot(batch_size, -1);
// For good measure we also output the path scores as computed by
// the beam decoder, so it can be compared in tests with the path
// scores computed via the indices in TF. This has the same length
// as beam_ids and slot_ids.
std::vector<float> path_scores;
// The scores tensor has, conceptually, four dimensions: 1. number
// of steps, 2. batch size, 3. number of paths on the beam at that
// step, and 4. the number of actions scored. However this is not
// a true tensor since the size of the beam at each step may not
// be equal among all steps and among all batches. Only the batch
// size and number of actions is fixed.
int path_id = 0;
for (int beam_id = 0; beam_id < batch_size; ++beam_id) {
// This occurs at the end of the corpus, when there aren't enough
// sentences to fill the batch.
if (batch_state->Beam(beam_id).gold_ == nullptr) continue;
// Populate the vectors that will index into the concatenated
// scores tensor.
int slot = 0;
for (const auto &item : batch_state->Beam(beam_id).slots_) {
beam_ids.push_back(beam_id);
slot_ids.push_back(slot);
path_scores.push_back(item.first.first);
VLOG(2) << "PATH SCORE @ beam_id:" << beam_id << " slot:" << slot
<< " : " << item.first.first << " " << item.first.second;
VLOG(2) << "SLOT HISTORY: "
<< utils::Join(item.second->slot_history, " ");
VLOG(2) << "SCORE HISTORY: "
<< utils::Join(item.second->score_history, " ");
VLOG(2) << "ACTION HISTORY: "
<< utils::Join(item.second->action_history, " ");
// Record where the gold path ended up.
if (item.second->state->is_gold()) {
CHECK_EQ(gold_slot[beam_id], -1);
gold_slot[beam_id] = slot;
}
for (size_t step = 0; step < item.second->slot_history.size(); ++step) {
const int step_beam_offset = batch_state->GetOffset(step, beam_id);
const int slot_index = item.second->slot_history[step];
const int action_index = item.second->action_history[step];
indices.push_back(num_actions * (step_beam_offset + slot_index) +
action_index);
path_ids.push_back(path_id);
}
++slot;
++path_id;
}
// One and only path must be the golden one.
CHECK_GE(gold_slot[beam_id], 0);
}
const int num_ix_elements = indices.size();
Tensor *output;
OP_REQUIRES_OK(context, context->allocate_output(
0, TensorShape({2, num_ix_elements}), &output));
auto indices_and_path_ids = output->matrix<int32>();
for (size_t i = 0; i < indices.size(); ++i) {
indices_and_path_ids(0, i) = indices[i];
indices_and_path_ids(1, i) = path_ids[i];
}
const int num_path_elements = beam_ids.size();
OP_REQUIRES_OK(context,
context->allocate_output(
1, TensorShape({2, num_path_elements}), &output));
auto beam_and_slot_ids = output->matrix<int32>();
for (size_t i = 0; i < beam_ids.size(); ++i) {
beam_and_slot_ids(0, i) = beam_ids[i];
beam_and_slot_ids(1, i) = slot_ids[i];
}
OP_REQUIRES_OK(context, context->allocate_output(
2, TensorShape({batch_size}), &output));
std::copy(gold_slot.begin(), gold_slot.end(), output->vec<int32>().data());
OP_REQUIRES_OK(context, context->allocate_output(
3, TensorShape({num_path_elements}), &output));
std::copy(path_scores.begin(), path_scores.end(),
output->vec<float>().data());
}
private:
TF_DISALLOW_COPY_AND_ASSIGN(BeamParserOutput);
};
REGISTER_KERNEL_BUILDER(Name("BeamParserOutput").Device(DEVICE_CPU),
BeamParserOutput);
// Computes eval metrics for the best path in the input beams.
class BeamEvalOutput : public OpKernel {
public:
explicit BeamEvalOutput(OpKernelConstruction *context) : OpKernel(context) {
// Set expected signature.
OP_REQUIRES_OK(context,
context->MatchSignature({DT_INT64}, {DT_INT32, DT_STRING}));
}
void Compute(OpKernelContext *context) override {
int num_tokens = 0;
int num_correct = 0;
int all_final = 0;
BatchState *batch_state =
reinterpret_cast<BatchState *>(context->input(0).scalar<int64>()());
const int batch_size = batch_state->BatchSize();
vector<Sentence> documents;
for (int beam_id = 0; beam_id < batch_size; ++beam_id) {
if (batch_state->Beam(beam_id).gold_ != nullptr &&
batch_state->Beam(beam_id).AllFinal()) {
++all_final;
const auto &item = *batch_state->Beam(beam_id).slots_.rbegin();
ComputeTokenAccuracy(*item.second->state, batch_state->ScoringType(),
&num_tokens, &num_correct);
documents.push_back(item.second->state->sentence());
item.second->state->AddParseToDocument(&documents.back());
}
}
Tensor *output;
OP_REQUIRES_OK(context,
context->allocate_output(0, TensorShape({2}), &output));
auto eval_metrics = output->vec<int32>();
eval_metrics(0) = num_tokens;
eval_metrics(1) = num_correct;
const int output_size = documents.size();
OP_REQUIRES_OK(context, context->allocate_output(
1, TensorShape({output_size}), &output));
for (int i = 0; i < output_size; ++i) {
output->vec<string>()(i) = documents[i].SerializeAsString();
}
}
private:
// Tallies the # of correct and incorrect tokens for a given ParserState.
void ComputeTokenAccuracy(const ParserState &state,
const string &scoring_type,
int *num_tokens, int *num_correct) {
for (int i = 0; i < state.sentence().token_size(); ++i) {
const Token &token = state.GetToken(i);
if (utils::PunctuationUtil::ScoreToken(token.word(), token.tag(),
scoring_type)) {
++*num_tokens;
if (state.IsTokenCorrect(i)) ++*num_correct;
}
}
}
TF_DISALLOW_COPY_AND_ASSIGN(BeamEvalOutput);
};
REGISTER_KERNEL_BUILDER(Name("BeamEvalOutput").Device(DEVICE_CPU),
BeamEvalOutput);
} // namespace syntaxnet
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Tests for beam_reader_ops."""
import os.path
import time
import tensorflow as tf
from tensorflow.python.framework import test_util
from tensorflow.python.platform import googletest
from tensorflow.python.platform import logging
from syntaxnet import structured_graph_builder
from syntaxnet.ops import gen_parser_ops
FLAGS = tf.app.flags.FLAGS
if not hasattr(FLAGS, 'test_srcdir'):
FLAGS.test_srcdir = ''
if not hasattr(FLAGS, 'test_tmpdir'):
FLAGS.test_tmpdir = tf.test.get_temp_dir()
class ParsingReaderOpsTest(test_util.TensorFlowTestCase):
def setUp(self):
# Creates a task context with the correct testing paths.
initial_task_context = os.path.join(
FLAGS.test_srcdir,
'syntaxnet/'
'testdata/context.pbtxt')
self._task_context = os.path.join(FLAGS.test_tmpdir, 'context.pbtxt')
with open(initial_task_context, 'r') as fin:
with open(self._task_context, 'w') as fout:
fout.write(fin.read().replace('SRCDIR', FLAGS.test_srcdir)
.replace('OUTPATH', FLAGS.test_tmpdir))
# Creates necessary term maps.
with self.test_session() as sess:
gen_parser_ops.lexicon_builder(task_context=self._task_context,
corpus_name='training-corpus').run()
self._num_features, self._num_feature_ids, _, self._num_actions = (
sess.run(gen_parser_ops.feature_size(task_context=self._task_context,
arg_prefix='brain_parser')))
def MakeGraph(self,
max_steps=10,
beam_size=2,
batch_size=1,
**kwargs):
"""Constructs a structured learning graph."""
assert max_steps > 0, 'Empty network not supported.'
logging.info('MakeGraph + %s', kwargs)
with self.test_session(graph=tf.Graph()) as sess:
feature_sizes, domain_sizes, embedding_dims, num_actions = sess.run(
gen_parser_ops.feature_size(task_context=self._task_context))
embedding_dims = [8, 8, 8]
hidden_layer_sizes = []
learning_rate = 0.01
builder = structured_graph_builder.StructuredGraphBuilder(
num_actions,
feature_sizes,
domain_sizes,
embedding_dims,
hidden_layer_sizes,
seed=1,
max_steps=max_steps,
beam_size=beam_size,
gate_gradients=True,
use_locking=True,
use_averaging=False,
check_parameters=False,
**kwargs)
builder.AddTraining(self._task_context,
batch_size,
learning_rate=learning_rate,
decay_steps=1000,
momentum=0.9,
corpus_name='training-corpus')
builder.AddEvaluation(self._task_context,
batch_size,
evaluation_max_steps=25,
corpus_name=None)
builder.training['inits'] = tf.group(*builder.inits.values(), name='inits')
return builder
def Train(self, **kwargs):
with self.test_session(graph=tf.Graph()) as sess:
max_steps = 3
batch_size = 3
beam_size = 3
builder = (
self.MakeGraph(
max_steps=max_steps, beam_size=beam_size,
batch_size=batch_size, **kwargs))
logging.info('params: %s', builder.params.keys())
logging.info('variables: %s', builder.variables.keys())
t = builder.training
sess.run(t['inits'])
costs = []
gold_slots = []
alive_steps_vector = []
every_n = 5
walltime = time.time()
for step in range(10):
if step > 0 and step % every_n == 0:
new_walltime = time.time()
logging.info(
'Step: %d <cost>: %f <gold_slot>: %f <alive_steps>: %f <iter '
'time>: %f ms',
step, sum(costs[-every_n:]) / float(every_n),
sum(gold_slots[-every_n:]) / float(every_n),
sum(alive_steps_vector[-every_n:]) / float(every_n),
1000 * (new_walltime - walltime) / float(every_n))
walltime = new_walltime
cost, gold_slot, alive_steps, _ = sess.run(
[t['cost'], t['gold_slot'], t['alive_steps'], t['train_op']])
costs.append(cost)
gold_slots.append(gold_slot.mean())
alive_steps_vector.append(alive_steps.mean())
if builder._only_train:
trainable_param_names = [
k for k in builder.params if k in builder._only_train]
else:
trainable_param_names = builder.params.keys()
if builder._use_averaging:
for v in trainable_param_names:
avg = builder.variables['%s_avg_var' % v].eval()
tf.assign(builder.params[v], avg).eval()
# Reset for pseudo eval.
costs = []
gold_slots = []
alive_stepss = []
for step in range(10):
cost, gold_slot, alive_steps = sess.run(
[t['cost'], t['gold_slot'], t['alive_steps']])
costs.append(cost)
gold_slots.append(gold_slot.mean())
alive_stepss.append(alive_steps.mean())
logging.info(
'Pseudo eval: <cost>: %f <gold_slot>: %f <alive_steps>: %f',
sum(costs[-every_n:]) / float(every_n),
sum(gold_slots[-every_n:]) / float(every_n),
sum(alive_stepss[-every_n:]) / float(every_n))
def PathScores(self, iterations, beam_size, max_steps, batch_size):
with self.test_session(graph=tf.Graph()) as sess:
t = self.MakeGraph(beam_size=beam_size, max_steps=max_steps,
batch_size=batch_size).training
sess.run(t['inits'])
all_path_scores = []
beam_path_scores = []
for i in range(iterations):
logging.info('run %d', i)
tensors = (
sess.run(
[t['alive_steps'], t['concat_scores'],
t['all_path_scores'], t['beam_path_scores'],
t['indices'], t['path_ids']]))
logging.info('alive for %s, all_path_scores and beam_path_scores, '
'indices and path_ids:'
'\n%s\n%s\n%s\n%s',
tensors[0], tensors[2], tensors[3], tensors[4], tensors[5])
logging.info('diff:\n%s', tensors[2] - tensors[3])
all_path_scores.append(tensors[2])
beam_path_scores.append(tensors[3])
return all_path_scores, beam_path_scores
def testParseUntilNotAlive(self):
"""Ensures that the 'alive' condition works in the Cond ops."""
with self.test_session(graph=tf.Graph()) as sess:
t = self.MakeGraph(batch_size=3, beam_size=2, max_steps=5).training
sess.run(t['inits'])
for i in range(5):
logging.info('run %d', i)
tf_alive = t['alive'].eval()
self.assertFalse(any(tf_alive))
def testParseMomentum(self):
"""Ensures that Momentum training can be done using the gradients."""
self.Train()
self.Train(model_cost='perceptron_loss')
self.Train(model_cost='perceptron_loss',
only_train='softmax_weight,softmax_bias', softmax_init=0)
self.Train(only_train='softmax_weight,softmax_bias', softmax_init=0)
def testPathScoresAgree(self):
"""Ensures that path scores computed in the beam are same in the net."""
all_path_scores, beam_path_scores = self.PathScores(
iterations=1, beam_size=130, max_steps=5, batch_size=1)
self.assertArrayNear(all_path_scores[0], beam_path_scores[0], 1e-6)
def testBatchPathScoresAgree(self):
"""Ensures that path scores computed in the beam are same in the net."""
all_path_scores, beam_path_scores = self.PathScores(
iterations=1, beam_size=130, max_steps=5, batch_size=22)
self.assertArrayNear(all_path_scores[0], beam_path_scores[0], 1e-6)
def testBatchOneStepPathScoresAgree(self):
"""Ensures that path scores computed in the beam are same in the net."""
all_path_scores, beam_path_scores = self.PathScores(
iterations=1, beam_size=130, max_steps=1, batch_size=22)
self.assertArrayNear(all_path_scores[0], beam_path_scores[0], 1e-6)
if __name__ == '__main__':
googletest.main()
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A program to generate ASCII trees from conll files."""
import collections
import asciitree
import tensorflow as tf
import syntaxnet.load_parser_ops
from tensorflow.python.platform import logging
from syntaxnet import sentence_pb2
from syntaxnet.ops import gen_parser_ops
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('task_context',
'syntaxnet/models/parsey_mcparseface/context.pbtxt',
'Path to a task context with inputs and parameters for '
'feature extractors.')
flags.DEFINE_string('corpus_name', 'stdin-conll',
'Path to a task context with inputs and parameters for '
'feature extractors.')
def to_dict(sentence):
"""Builds a dictionary representing the parse tree of a sentence.
Args:
sentence: Sentence protocol buffer to represent.
Returns:
Dictionary mapping tokens to children.
"""
token_str = ['%s %s %s' % (token.word, token.tag, token.label)
for token in sentence.token]
children = [[] for token in sentence.token]
root = -1
for i in range(0, len(sentence.token)):
token = sentence.token[i]
if token.head == -1:
root = i
else:
children[token.head].append(i)
def _get_dict(i):
d = collections.OrderedDict()
for c in children[i]:
d[token_str[c]] = _get_dict(c)
return d
tree = collections.OrderedDict()
tree[token_str[root]] = _get_dict(root)
return tree
def main(unused_argv):
logging.set_verbosity(logging.INFO)
with tf.Session() as sess:
src = gen_parser_ops.document_source(batch_size=32,
corpus_name=FLAGS.corpus_name,
task_context=FLAGS.task_context)
sentence = sentence_pb2.Sentence()
while True:
documents, finished = sess.run(src)
logging.info('Read %d documents', len(documents))
for d in documents:
sentence.ParseFromString(d)
tr = asciitree.LeftAligned()
d = to_dict(sentence)
print 'Input: %s' % sentence.text
print 'Parse:'
print tr(d)
if finished:
break
if __name__ == '__main__':
tf.app.run()
Parameter {
name: 'brain_parser_embedding_dims'
value: '64;32;32'
}
Parameter {
name: 'brain_parser_features'
value: 'input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word;input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag;stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label'
}
Parameter {
name: 'brain_parser_embedding_names'
value: 'words;tags;labels'
}
Parameter {
name: 'brain_parser_scoring'
value: 'default'
}
Parameter {
name: 'brain_pos_transition_system'
value: 'tagger'
}
Parameter {
name: 'brain_pos_embedding_dims'
value: '64;4;8;8'
}
Parameter {
name: 'brain_pos_features'
value: 'stack(3).word stack(2).word stack(1).word stack.word input.word input(1).word input(2).word input(3).word;input.digit input.hyphen;stack.suffix(length=2) input.suffix(length=2) input(1).suffix(length=2);stack.prefix(length=2) input.prefix(length=2) input(1).prefix(length=2)'
}
Parameter {
name: 'brain_pos_embedding_names'
value: 'words;other;suffix;prefix'
}
input {
name: 'training-corpus'
record_format: 'conll-sentence'
Part {
file_pattern: '<your-dataset>/treebank-train.trees.conll'
}
}
input {
name: 'tuning-corpus'
record_format: 'conll-sentence'
Part {
file_pattern: '<your-dataset>/dev.conll'
}
}
input {
name: 'dev-corpus'
record_format: 'conll-sentence'
Part {
file_pattern: '<your-dataset>/test.conll'
}
}
input {
name: 'tagged-training-corpus'
creator: 'brain_pos/greedy'
record_format: 'conll-sentence'
}
input {
name: 'tagged-tuning-corpus'
creator: 'brain_pos/greedy'
record_format: 'conll-sentence'
}
input {
name: 'tagged-dev-corpus'
creator: 'brain_pos/greedy'
record_format: 'conll-sentence'
}
input {
name: 'label-map'
creator: 'brain_pos/greedy'
}
input {
name: 'word-map'
creator: 'brain_pos/greedy'
}
input {
name: 'lcword-map'
creator: 'brain_pos/greedy'
}
input {
name: 'tag-map'
creator: 'brain_pos/greedy'
}
input {
name: 'category-map'
creator: 'brain_pos/greedy'
}
input {
name: 'prefix-table'
creator: 'brain_pos/greedy'
}
input {
name: 'suffix-table'
creator: 'brain_pos/greedy'
}
input {
name: 'tag-to-category'
creator: 'brain_pos/greedy'
}
input {
name: 'projectivized-training-corpus'
creator: 'brain_parser/greedy'
record_format: 'conll-sentence'
}
input {
name: 'parsed-training-corpus'
creator: 'brain_parser/greedy'
record_format: 'conll-sentence'
}
input {
name: 'parsed-tuning-corpus'
creator: 'brain_parser/greedy'
record_format: 'conll-sentence'
}
input {
name: 'parsed-dev-corpus'
creator: 'brain_parser/greedy'
record_format: 'conll-sentence'
}
input {
name: 'beam-parsed-training-corpus'
creator: 'brain_parser/structured'
record_format: 'conll-sentence'
}
input {
name: 'beam-parsed-tuning-corpus'
creator: 'brain_parser/structured'
record_format: 'conll-sentence'
}
input {
name: 'beam-parsed-dev-corpus'
creator: 'brain_parser/structured'
record_format: 'conll-sentence'
}
input {
name: 'stdin'
record_format: 'english-text'
Part {
file_pattern: '-'
}
}
input {
name: 'stdin-conll'
record_format: 'conll-sentence'
Part {
file_pattern: '-'
}
}
input {
name: 'stdout-conll'
record_format: 'conll-sentence'
Part {
file_pattern: '-'
}
}
#!/bin/bash
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
# A script that runs a tokenizer, a part-of-speech tagger and a dependency
# parser on an English text file, with one sentence per line.
#
# Example usage:
# echo "Parsey McParseface is my favorite parser!" | syntaxnet/demo.sh
# To run on a conll formatted file, add the --conll command line argument.
#
PARSER_EVAL=bazel-bin/syntaxnet/parser_eval
MODEL_DIR=syntaxnet/models/parsey_mcparseface
[[ "$1" == "--conll" ]] && INPUT_FORMAT=stdin-conll || INPUT_FORMAT=stdin
$PARSER_EVAL \
--input=$INPUT_FORMAT \
--output=stdout-conll \
--hidden_layer_sizes=64 \
--arg_prefix=brain_tagger \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/tagger-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
$PARSER_EVAL \
--input=stdin-conll \
--output=stdout-conll \
--hidden_layer_sizes=512,512 \
--arg_prefix=brain_parser \
--graph_builder=structured \
--task_context=$MODEL_DIR/context.pbtxt \
--model_path=$MODEL_DIR/parser-params \
--slim_model \
--batch_size=1024 \
--alsologtostderr \
| \
bazel-bin/syntaxnet/conll2tree \
--task_context=$MODEL_DIR/context.pbtxt \
--alsologtostderr
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment