README.md 2.17 KB
Newer Older
1
2
3
4
5
6
7
8
9
This directory contains models for unsupervised training of word embeddings
using the model described in:

(Mikolov, et. al.) [Efficient Estimation of Word Representations in Vector Space](http://arxiv.org/abs/1301.3781),
ICLR 2013.

Detailed instructions on how to get started and use them are available in the
tutorials. Brief instructions are below.

10
* [Word2Vec Tutorial](http://tensorflow.org/tutorials/word2vec)
11

12
Assuming you have cloned the git repository, navigate into this directory. To download the example text and evaluation data:
13
14

```shell
15
curl http://mattmahoney.net/dc/text8.zip > text8.zip
16
unzip text8.zip
17
curl https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip > source-archive.zip
18
unzip -p source-archive.zip  word2vec/trunk/questions-words.txt > questions-words.txt
19
rm text8.zip source-archive.zip
20
21
```

22
You will need to compile the ops as follows:
23

24
```shell
25
TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
26
g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC -I $TF_INC -O2 -D_GLIBCXX_USE_CXX11_ABI=0
27
28
```

29
30
On Mac, add `-undefined dynamic_lookup` to the g++ command.

31
(For an explanation of what this is doing, see the tutorial on [Adding a New Op to TensorFlow](https://www.tensorflow.org/how_tos/adding_an_op/#building_the_op_library). The flag `-D_GLIBCXX_USE_CXX11_ABI=0` is included to support newer versions of g++. If you compiled TensorFlow from source using g++ 5 or later, you may need to exclude the flag.)
32
Then run using:
33
34

```shell
35
python word2vec_optimized.py \
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
  --train_data=text8 \
  --eval_data=questions-words.txt \
  --save_path=/tmp/
```

Here is a short overview of what is in this directory.

File | What's in it?
--- | ---
`word2vec.py` | A version of word2vec implemented using TensorFlow ops and minibatching.
`word2vec_test.py` | Integration test for word2vec.
`word2vec_optimized.py` | A version of word2vec implemented using C ops that does no minibatching.
`word2vec_optimized_test.py` | Integration test for word2vec_optimized.
`word2vec_kernels.cc` | Kernels for the custom input and training ops.
`word2vec_ops.cc` | The declarations of the custom ops.