README.md 8.26 KB
Newer Older
Martin Wicke's avatar
Martin Wicke committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Swivel in Tensorflow

This is a [TensorFlow](http://www.tensorflow.org/) implementation of the
[Swivel algorithm](http://arxiv.org/abs/1602.02215) for generating word
embeddings.

Swivel works as follows:

1. Compute the co-occurrence statistics from a corpus; that is, determine how
   often a word *c* appears the context (e.g., "within ten words") of a focus
   word *f*.  This results in a sparse *co-occurrence matrix* whose rows
   represent the focus words, and whose columns represent the context
   words. Each cell value is the number of times the focus and context words
   were observed together.
2. Re-organize the co-occurrence matrix and chop it into smaller pieces.
3. Assign a random *embedding vector* of fixed dimension (say, 300) to each
   focus word and to each context word.
4. Iteratively attempt to approximate the
   [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
   (PMI) between words with the dot product of the corresponding embedding
   vectors.

Note that the resulting co-occurrence matrix is very sparse (i.e., contains many
zeros) since most words won't have been observed in the context of other words.
In the case of very rare words, it seems reasonable to assume that you just
haven't sampled enough data to spot their co-occurrence yet.  On the other hand,
27
if we've failed to observed two common words co-occuring, it seems likely that
Martin Wicke's avatar
Martin Wicke committed
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
they are *anti-correlated*.

Swivel attempts to capture this intuition by using both the observed and the
un-observed co-occurrences to inform the way it iteratively adjusts vectors.
Empirically, this seems to lead to better embeddings, especially for rare words.

# Contents

This release includes the following programs.

* `prep.py` is a program that takes a text corpus and pre-processes it for
  training. Specifically, it computes a vocabulary and token co-occurrence
  statistics for the corpus.  It then outputs the information into a format that
  can be digested by the TensorFlow trainer.
* `swivel.py` is a TensorFlow program that generates embeddings from the
  co-occurrence statistics.  It uses the files created by `prep.py` as input,
  and generates two text files as output: the row and column embeddings.
Chris Waterson's avatar
Chris Waterson committed
45
46
47
* `distributed.sh` is a Bash script that is meant to act as a template for
  launching "distributed" Swivel training; i.e., multiple processes that work in
  parallel and communicate via a parameter server.
Martin Wicke's avatar
Martin Wicke committed
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
* `text2bin.py` combines the row and column vectors generated by Swivel into a
  flat binary file that can be quickly loaded into memory to perform vector
  arithmetic.  This can also be used to convert embeddings from
  [Glove](http://nlp.stanford.edu/projects/glove/) and
  [word2vec](https://code.google.com/archive/p/word2vec/) into a form that can
  be used by the following tools.
* `nearest.py` is a program that you can use to manually inspect binary
  embeddings.
* `eval.mk` is a GNU makefile that fill retrieve and normalize several common
  word similarity and analogy evaluation data sets.
* `wordsim.py` performs word similarity evaluation of the resulting vectors.
* `analogy` performs analogy evaluation of the resulting vectors.
* `fastprep` is a C++ program that works much more quickly that `prep.py`, but
  also has some additional dependencies to build.

# Building Embeddings with Swivel

To build your own word embeddings with Swivel, you'll need the following:

* A large corpus of text; for example, the
  [dump of English Wikipedia](https://dumps.wikimedia.org/enwiki/).
* A working [TensorFlow](http://www.tensorflow.org/) implementation.
* A machine with plenty of disk space and, ideally, a beefy GPU card.  (We've
  experimented with the
  [Nvidia Titan X](http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x),
  for example.)

You'll then run `prep.py` (or `fastprep`) to prepare the data for Swivel and run
`swivel.py` to create the embeddings. The resulting embeddings will be output
into two large text files: one for the row vectors and one for the column
vectors.  You can use those "as is", or convert them into a binary file using
`text2bin.py` and then use the tools here to experiment with the resulting
vectors.

## Preparing the data for training

Once you've downloaded the corpus (e.g., to `/tmp/wiki.txt`), run `prep.py` to
prepare the data for training:

    ./prep.py --output_dir /tmp/swivel_data --input /tmp/wiki.txt

By default, `prep.py` will make one pass through the text file to compute a
"vocabulary" of the most frequent words, and then a second pass to compute the
co-occurrence statistics.  The following options allow you to control this
behavior:

Chris Waterson's avatar
Chris Waterson committed
94
| Option | Description |
Martin Wicke's avatar
Martin Wicke committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
|:--- |:--- |
| `--min_count <n>` | Only include words in the generated vocabulary that appear at least *n* times. |
| `--max_vocab <n>` | Admit at most *n* words into the vocabulary. |
| `--vocab <filename>` | Use the specified filename as the vocabulary instead of computing it from the corpus.  The file should contain one word per line. |

The `prep.py` program is pretty simple.  Notably, it does almost no text
processing: it does no case translation and simply breaks text into tokens by
splitting on spaces. Feel free to experiment with the `words` function if you'd
like to do something more sophisticated.

Unfortunately, `prep.py` is pretty slow.  Also included is `fastprep`, a C++
equivalent that works much more quickly.  Building `fastprep.cc` is a bit more
involved: it requires you to pull and build the Tensorflow source code in order
to provide the libraries and headers that it needs.  See `fastprep.mk` for more
details.

## Training the embeddings

When `prep.py` completes, it will have produced a directory containing the data
that the Swivel trainer needs to run.  Train embeddings as follows:

    ./swivel.py --input_base_path /tmp/swivel_data \
       --output_base_path /tmp/swivel_data

There are a variety of parameters that you can fiddle with to customize the
embeddings; some that you may want to experiment with include:

Chris Waterson's avatar
Chris Waterson committed
122
| Option | Description |
Martin Wicke's avatar
Martin Wicke committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
|:--- |:--- |
| `--embedding_size <dim>` | The dimensionality of the embeddings that are created.  By default, 300 dimensional embeddings are created. |
| `--num_epochs <n>` | The number of iterations through the data that are performed.  By default, 40 epochs are trained. |

As mentioned above, access to beefy GPU will dramatically reduce the amount of
time it takes Swivel to train embeddings.

When complete, you should find `row_embeddings.tsv` and `col_embedding.tsv` in
the directory specified by `--ouput_base_path`.  These files are tab-delimited
files that contain one embedding per line.  Each line contains the token
followed by *dim* floating point numbers.

## Exploring and evaluating the embeddings

There are also some simple tools you can to explore the embeddings.  These tools
work with a simple binary vector format that can be `mmap`-ed into memory along
with a separate vocabulary file.  Use `text2bin.py` to generate these files:

    ./text2bin.py -o vecs.bin -v vocab.txt /tmp/swivel_data/*_embedding.tsv

You can do some simple exploration using `nearest.py`:

    ./nearest.py -v vocab.txt -e vecs.bin
    query> dog
    dog
    dogs
    cat
    ...
    query> man woman king
    king
    queen
    princess
    ...

To evaluate the embeddings using common word similarity and analogy datasets,
use `eval.mk` to retrieve the data sets and build the tools:

    make -f eval.mk
    ./wordsim.py -v vocab.txt -e vecs.bin *.ws.tab
    ./analogy --vocab vocab.txt --embeddings vecs.bin *.an.tab

The word similarity evaluation compares the embeddings' estimate of "similarity"
with human judgement using
[Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
as the measure of correlation.  (Bigger numbers are better.)

The analogy evaluation tests how well the embeddings can predict analogies like
"man is to woman as king is to queen".

Note that `eval.mk` forces all evaluation data into lower case.  From there,
both the word similarity and analogy evaluations assume that the eval data and
the embeddings use consistent capitalization: if you train embeddings using
mixed case and evaluate them using lower case, things won't work well.

# Contact

If you have any questions about Swivel, feel free to post to
180
[swivel-embeddings@googlegroups.com](https://groups.google.com/forum/#!forum/swivel-embeddings).
Martin Wicke's avatar
Martin Wicke committed
181